New approaches are revolutionizing the market for embedded AI. A massive wave of innovation is pushing the boundaries of performance, security, and efficiency. The integration of specialized AI units and the emergence of edge-capable AI SoCs are bringing more intelligence into silicon.
Development workflow with Renesas e² studio: With Renesas e² studio, IoT applications can be downloaded from GitHub, configured, developed, and deployed on RA, RX, or RZ controllers – including connectivity to AWS and Microsoft Azure.
(Image: Renesas Electronics Corporation)
AI logic is moving from the cloud and central data centers directly into the chips of end devices, where the data is created. Application scenarios like robotics require neural networks and decision logic to operate in real-time—with minimal latency, high data protection, and often without any connection to the cloud. This decentralization is changing both the performance requirements and the security and energy profiles of the systems.
What was considered a high-end feature just a few years ago is becoming standard in microcontrollers, SoCs, and edge processors. The latest generations of microcontrollers combine classical signal processing with dedicated neural processing units (NPUs) or ML accelerators.
This integration is fundamentally transforming embedded development: classical control logic is being replaced by adaptive systems that learn, recognize, and decide. Manufacturers like STMicroelectronics NV, Renesas Electronics Corporation, and NXP Semiconductors are merging classical control electronics with neural computing logic—supported by specialized NPUs and machine learning accelerators. As a result, even energy-efficient sensor nodes or wearables can excel with local pattern recognition and predictive maintenance.
Some innovative embedded AI cores are emerging as particularly technologically groundbreaking. Three current platforms exemplify where embedded development is headed.
Three-Layer Cake: AI in the chip
With the STM32N6 series, STMicroelectronics introduces AI acceleration to the classic microcontroller segment for the first time. Based on the Arm Cortex-M55 core and enhanced with the in-house Neural-ART Accelerator, the chip achieves up to 600 giga-operations per second with a power consumption of only a few hundred milliwatts.
An STM32N6 Nucleo: Macronix and STMicroelectronics formed a strategic partnership to equip the AI-accelerated STM32N6 MCU platform with OctaFlash memory.
(Image: r/stm32 via Reddit)
This performance is made possible by a combination of deeply pipelined multiply-accumulate structures, optimized data paths, and a memory architecture that executes inference operations almost without external memory access.
The computing units operate highly in parallel. The internal SRAM cluster provides data in burst cycles, avoiding latencies.
The Neural-ART block uses quantized INT8 and partially binary operations to reduce the energy consumption per operation. Short signal paths, minimized capacitances, and precisely clocked clock-gating logic enhance efficiency. By eliminating typical loss sources of classical digital designs, even complex tasks such as audio recognition, object detection, or gesture control become executable locally—in devices that previously possessed barely more computing power than a smart watch.
The STM32N6 family demonstrates that AI logic can now be miniaturized to the extent that it fits directly into classic MCU designs—a paradigm shift for developers who previously strictly separated control and inference.
One level above, Renesas positions itself with the RZ/V2H, a highly integrated SoC for industrial applications, robotics, and edge vision. The processor combines four Cortex-A55 cores (Linux layer), two Cortex-R8 for deterministic real-time control loops, and a Cortex-M33 for low-level peripherals as well as safety/housekeeping.
The design is consistently optimized for minimizing data movement. At its core is the DRP-AI3 accelerator, a dynamically reconfigurable processor architecture. Convolution, activation, pooling, and element-wise cores are electronically switched and dynamically combined at runtime. This reduces silicon area and leakage currents, but above all, external memory traffic, as the operator chains are executed as data flows tightly linked to local scratchpads.
The RZ/V2H provides a separate DRP/DRP-AI bus alongside the ACPU, RCPU, and MCPU bus; 6 MB of on-chip SRAM serve as a multi-level tiling/scratchpad buffer for feature maps and weights, while a dedicated DMA bursts tiles for loading. The sparsity processing in the DRP-AI3 suppresses zero operations already on the control path, ensuring MAC arrays clock only for "useful" elements. Combined with INT8 quantization and aggressive clock/power-gating in fine-grained power islands, the system achieves high efficiency per TOP.
For vision pipelines, the data path is short-circuited: MIPI-CSI-2 cameras feed through the video bus into the Mali-C55 ISP, handling demosaicing/noise reduction; the DRP then takes over (classic image operators, OpenCV equivalents) before the DRP-AI3 processes CNN/Transformer inference. This reduces round trips to external LPDDR4/4X DRAM and avoids bandwidth and latency peaks.
Date: 08.12.2025
Naturally, we always handle your personal data responsibly. Any personal data we receive from you is processed in accordance with applicable data protection legislation. For detailed information please see our privacy policy.
Consent to the use of data for promotional purposes
I hereby consent to Vogel Communications Group GmbH & Co. KG, Max-Planck-Str. 7-9, 97082 Würzburg including any affiliated companies according to §§ 15 et seq. AktG (hereafter: Vogel Communications Group) using my e-mail address to send editorial newsletters. A list of all affiliated companies can be found here
Newsletter content may include all products and services of any companies mentioned above, including for example specialist journals and books, events and fairs as well as event-related products and services, print and digital media offers and services such as additional (editorial) newsletters, raffles, lead campaigns, market research both online and offline, specialist webportals and e-learning offers. In case my personal telephone number has also been collected, it may be used for offers of aforementioned products, for services of the companies mentioned above, and market research purposes.
Additionally, my consent also includes the processing of my email address and telephone number for data matching for marketing purposes with select advertising partners such as LinkedIn, Google, and Meta. For this, Vogel Communications Group may transmit said data in hashed form to the advertising partners who then use said data to determine whether I am also a member of the mentioned advertising partner portals. Vogel Communications Group uses this feature for the purposes of re-targeting (up-selling, cross-selling, and customer loyalty), generating so-called look-alike audiences for acquisition of new customers, and as basis for exclusion for on-going advertising campaigns. Further information can be found in section “data matching for marketing purposes”.
In case I access protected data on Internet portals of Vogel Communications Group including any affiliated companies according to §§ 15 et seq. AktG, I need to provide further data in order to register for the access to such content. In return for this free access to editorial content, my data may be used in accordance with this consent for the purposes stated here. This does not apply to data matching for marketing purposes.
Right of revocation
I understand that I can revoke my consent at will. My revocation does not change the lawfulness of data processing that was conducted based on my consent leading up to my revocation. One option to declare my revocation is to use the contact form found at https://contact.vogel.de. In case I no longer wish to receive certain newsletters, I have subscribed to, I can also click on the unsubscribe link included at the end of a newsletter. Further information regarding my right of revocation and the implementation of it as well as the consequences of my revocation can be found in the data protection declaration, section editorial newsletter.
Mechanisms such as deterministic R8 side paths (safety-critical, watchdogs, motor control) and the logical separation of bus domains improve temporal predictability compared to GPU-centric designs.
From a manufacturing perspective, Renesas relies on a highly integrated mixed-signal SoC topology with short interconnect lengths and low-capacitance local wiring to reduce energy per transmitted bit. Multiple voltage/clock domains enable DVFS per subsystem (A-cluster, R-cluster, DRP, DRP-AI, video path); critical networks utilize buffer insertion and targeted shielding to reduce jitter.
The focus is less on maximum peak frequency and more on high efficiency at moderate clock speeds and within a tight thermal budget—perfectly suited for fanless edge systems in robotics and industry. The architecture combines classical control logic and AI processing in a compact space, enabling immediate interpretation of visual/sensory data without a cloud roundtrip, resulting in lower latency, better determinism, and a clear efficiency gain in safety-critical applications.
The next development leap is marked by NXP with the i.MX 95 family, which shifts the concept of embedded AI into the high-end range. Up to six Arm Cortex-A55 cores, combined with GPU, ISP, and the integrated eIQ Neutron NPU, deliver computing power and neural processing at a level previously only achievable in industrial PCs. This platform supports not only classical image and speech processing but also the execution of compact generative AI models in inference mode, while meeting the high safety and real-time requirements of the automotive and industrial sectors. This creates a new device class between embedded systems and edge servers—powerful enough for AI inference on video streams or sensor data, yet compact enough for control cabinets or vehicles.
In the fall of 2025, Synaptics introduced its new Edge AI platform, an SoC for IoT and smart device applications above classic MCUs – similar to the NXP i.MX 95, but more strongly optimized for the consumer edge. The platform combines a multi-core CPU cluster with a dedicated NPU pipeline and an integrated ISP for image and audio data.
Particularly notable is the use of a hybrid memory architecture with shared L3 cache and low-power SRAM areas. The shared L3 cache can be dynamically allocated between CPU and NPU cores, prioritizing memory accesses and reducing latencies. Additionally, Synaptics implements staggered power gating, where parts of the SRAM remain in deep sleep mode while active blocks continue buffering data—significantly reducing leakage current compared to the fixed cache architecture of the i.MX 95. This substantially increases energy efficiency.
Proven in Practice and Domain-Specific
The domestic industry is also actively shaping the path to embedded AI in silicon form – with a strong focus on robust manufacturing, automotive qualification, and energy-efficient chip design rather than maximum TOPS numbers – albeit with different priorities in chip design and the integration of neural hardware.
Infineon is increasingly bringing AI-capable MCUs to the market with the AURIX TC4x series and the new PSoC Edge series. The architecture combines classical safety MCUs with integrated DSP blocks for neural inference from sensor data—primarily for automotive pattern recognition and deterministic controls. Particularly notable are separate signal paths for analog front-ends and AI computations, a hierarchical on-chip memory system for data locality, and an ultra-efficient power domain separation designed for automotive temperature ranges up to 175 °C (~347 °F).
Infineon uses proven 28 nm (~0.0011 inches) automotive CMOS processes with optimized metal layers to ensure low leakage currents and stable timings under voltage fluctuations.
Bosch pursues an application-specific approach: the company integrates neural networks directly into its automotive ASICs for radar, lidar, and camera sensors. These designs rely on analog preprocessing, followed by compact digital inference arrays specifically tailored to the characteristics of vehicle environments—such as temperature drift, vibration stress, and electromagnetic interference.
Bosch uses proprietary mixed-signal layouts that combine sensor technology, AD converters, and AI logic on a single substrate. Particularly notable is the use of local memory clusters in the neural arrays. The data paths remain extremely short, reducing energy consumption per inference cycle by up to 70 percent.
Siemens positions itself with its EDA subsidiary Mentor and tools like Catapult AI as a partner for automated high-level synthesis of energy-efficient neural hardware. The focus is on the co-optimization of circuit design, timing, and energy profile to make silicon IP validatable for external foundries. This ensures that AI blocks are assessed for silicon efficiency already in the design phase—a crucial contribution for European semiconductor manufacturers relying on compact, validated AI IP cores.
Siemens is working closely with research partners and European foundries to bridge the gap between EDA design and AI hardware.
Compared to the STM32N6, RZ/V2H, and i.MX95, German manufacturers focus more on safety-critical or automotive applications rather than on universal AI cores. Infineon emphasizes functional safety, reliability, and ISO 26262 compliance rather than pure TOPS performance. Bosch integrates AI into application-specific ASICs (radar, camera), achieving high energy efficiency but with limited flexibility. Siemens operates more as a tool provider. The solutions of these manufacturers excel in robustness, safety integration, and automotive qualification.
A radical change is also taking place at the SoC level. Edge-capable platforms like NVIDIA Jetson Orin, Qualcomm QCS8550, or Ambarella CV3 merge CPU, GPU, NPU, and often ISP and security functions into a single, highly integrated system. The challenge lies in balancing computational performance, thermal budget, and deterministic behavior. In safety-critical areas such as automotive, medical technology, or industrial automation, these characteristics determine acceptance and approval.
NVIDIA Jetson Nano, a developer kit for edge AI applications in robotics and embedded systems.
(Image: NVIDIA)
SoCs from leading German providers are highly domain-specific. Infineon integrates inference-capable DSP blocks into the AURIX-TC4x series, combining safety MCU logic with neural signal processing for automotive and industrial applications. Bosch develops ASICs with embedded neural networks for radar and camera sensors and is researching self-learning MEMS. Siemens, through Mentor/Catapult AI, focuses on toolchains but also supports partners in developing edge AI ASICs. Overall, the focus in Germany is more on safety-certified, energy-efficient AI-on-chip approaches for specialized applications, rather than on universal NPUs.
In research labs, a new generation of architectures for "AI on the chip" is already in the starting blocks.
Between Stuttgart, Dresden, and Bochum
Q.ANT (Stuttgart), a TRUMPF subsidiary, is taking a photonics-based approach. Its photonic accelerator uses light interferences to perform neural computations with extremely high energy efficiency—so far as a proof-of-concept in a laboratory setting; industrial mass application is still in its early stages. The goal is not an MCU but a high-performance edge module for real-time AI in industrial sensing and robotics.
Against this backdrop, the trend toward analog AI chips for the intelligent edge continues. This class of chips computes neural networks directly in the analog domain. The technology is still predominantly in the laboratory and early prototype stage.
Semron (Dresden) is working on a 3D compute-in-memory architecture called CapRAM, where computations take place directly in memory. The technology combines electrical and capacitive effects into a chip design that could represent neural networks analog on a tiny scale – with target values of several hundred million parameters per square millimeter.
Gemesys from Bochum follows the principle of analog learning with a biologically inspired architecture: a neuromorphic chip that independently adjusts its "synapses" through current flows instead of calculating digital weights.
Analog Learning in Hardware with Neuromorphic Chips from Gemesys
While today's AI cores accelerate digital inference, startups like Gemesys are already working on the next stage: analog, neuromorphic processors that perform training and learning physically in silicon. These approaches are still laboratory technologies, but they indicate where embedded development is headed in the long term – toward systems that not only compute but truly learn.
The startup Gemesys is developing an entirely new class of AI chips: analog-neuromorphic processing units for bio-inspired learning.
Instead of digitally simulating neural networks, Gemesys transfers the principles of biological information processing directly onto silicon with new circuit designs, the so-called memristors. The architecture emulates synaptic plasticity with the goal of enabling continuous machine learning directly on the chip—without the energy-intensive matrix multiplication of a GPU cluster and without the roundtrip to the cloud.
"The special feature of a memristor," explains Dr.-Ing. Dennis Michaelis, one of the three co-founders, "is its ability not only to 'store the resistance value' but also 'the state it was last in.' This enables 'processing and storing information in the same physical location': in-memory computing."
While classical AI accelerators (e.g., GPUs, NPUs) compute neural networks numerically, Gemesys leverages the conductivity of analog circuits to replicate synaptic weightings. The chip is designed to operate at low clock rates of a few MHz and with a fraction of the energy.
Gemesys' compute-in-memory architecture eliminates fetch cycles, reduces latency, and saves up to two orders of magnitude in energy. Early lab prototypes are expected to perform learning tasks with less than 10 mW – for tasks where conventional NPUs would consume several watts.
The chip can not only infer but also train independently—a true breakthrough compared to NPUs. The data does not need to leave the chip, reducing bandwidth, latency, and security risks. Currently, the approach is not aimed at classic embedded MCUs but at edge intelligence in sensing, robotics, and IoT devices, where continuous on-site learning plays a critical role—for example, in adaptive motor controls, self-calibrating systems, or acoustic signal recognition.
The basic element is a type of "programmable synapse" based on memristive or transistor-like analog components. These store the learning weights continuously, not quantized, and adjust them through current flow—similar to how biological synapses adapt their strength through signals. Training is based on on-chip adaptability through analog feedback.
Storage occurs directly within the component; no RAM, no fetch cycle, no quantization is required. Resistive or capacitive analog memory is used as the storage medium.
"Memristors have the potential to realize neural networks with billions of parameters on the area of a fingernail," says Dr.-Ing. Dennis Michaelis, co-founder and managing director of GEMESYS GmbH from Bochum.
(Image: GEMESYS)
As part of the EMULAITE project, Gemesys is developing a groundbreaking approach for AI training of chips in cooperation with the Chair of Production Systems (LPS) at Ruhr University Bochum.
The neural network is implemented as an analog circuit based on energy-efficient memristor technology—completely without conventional matrix multiplications or frequent memory accesses. Memristors support synaptic weight adjustments as direct physical processes within the circuit: learning occurs through current flows and resistance changes.
This technology leads to a drastic reduction in energy consumption for AI training: estimates suggest that EMULAITE's method is 100 to 1,000 times more energy-efficient than conventional gradient-based approaches. This marks a milestone for sustainable and resource-efficient edge AI training. For the first time, battery-powered devices and embedded systems can learn locally on the chip without relying on data-intensive cloud solutions.
The project opens up new possibilities for the use of artificial intelligence in energy-constrained environments and is funded under the Green Startups Initiative.NRW.
Gemesys is still at the beginning of commercialization. Analog systems require precise calibration and compensating algorithms to offset temperature or drift effects. Nevertheless, the concept is considered promising: instead of simulating neural networks, they are implemented.
Related concepts are being explored by Innatera Nanosystems from the Netherlands and California-based BrainChip. However, with its analog neuromorphic chip, Gemesys operates in a new dimension of AI processing: at the intersection of data-driven AI and materially anchored intelligence.
Conclusion
The latest generation of microcontrollers and SoCs embeds artificial intelligence directly into the hardware. Where software libraries previously ran on external processors, specialized AI cores now handle complete inference—directly on the chip, with minimal latency and remarkable efficiency.
The latest innovations highlight how the embedded world is realigning: AI is becoming a hardware function, as natural as a timer or ADC. It is changing the way developers write software, test systems, and implement security concepts. Reactive controls are transforming into adaptive systems, and fixed processes into learning workflows. The classic microcontroller, long a symbol of deterministic simplicity, is evolving into a powerhouse of edge intelligence. (mbf)
*Anna Kobylinska and Filipe Pereia Martins work for McKinley Denali, Inc., USA.