Embedded AI With Real-Time Applications Optimize Edge AI With NPUs and Model Compression

From Christoph Stockhammer* | Translated by AI 4 min Reading Time

Related Vendors

Through the clever combination of NPUs and AI model design with strategic compression techniques, embedded devices can be transformed into efficient, high-performance real-time decision-makers—ready to act in real-time.

An NPU is an economical, energy-efficient solution designed for efficient AI inference and neural network calculations in embedded systems.(Image: Yuichiro Chino / MathWorks)
An NPU is an economical, energy-efficient solution designed for efficient AI inference and neural network calculations in embedded systems.
(Image: Yuichiro Chino / MathWorks)

More and more devices are becoming intelligent. From smartphones and connected sensors to autonomous driving, edge AI is increasingly coming into focus. Edge AI refers to artificial intelligence that makes decisions directly on the device, without routing through the cloud. However, powerful models require significant computing power, memory, and energy resources. At the same time, the need for real-time or near-real-time decisions is growing, further driving demand for high-performance edge solutions. This is precisely where Neural Processing Units (NPUs) come into play: they are specifically designed to run complex AI models with low latency and minimal energy consumption, opening up new practical possibilities. One of the biggest challenges, however, remains minimizing inference time—the time a model needs to make a prediction. Especially in motor control, inference time often needs to be under 10 milliseconds to ensure system stability and responsiveness, and to prevent mechanical stress or damage.

NPUs are specifically designed for AI inference and computations in neural networks. This makes them particularly suitable for embedded systems where computing power is limited and energy efficiency is critical. Unlike CPUs, which serve as general-purpose processors, or GPUs, which are powerful but energy-intensive, NPUs are optimized for the efficient computation of matrix operations, the core of neural networks. While GPUs can also be used for AI inference, NPUs stand out with significantly lower energy consumption and reduced costs.

From an economic perspective, NPUs present an attractive alternative to microcontrollers (MCUs), GPUs, or FPGAs for AI tasks. While chips with integrated NPUs are more expensive to purchase than basic microcontrollers, their added value lies in superior energy efficiency and AI performance. These features reduce operating costs over the long term, extend battery life, and open up new applications for embedded systems. Additionally, NPUs enable real-time AI processing without relying on costly and power-hungry alternatives like GPUs or FPGAs.

Lean AI Models for the Edge: Projection And Quantization in Use

At the same time, NPUs face limitations: memory and energy are constrained. Therefore, model compression is crucial to reduce the size and complexity of models, enabling real-time performance.

Model compression techniques are crucial for deploying large AI models at the edge. They reduce size and complexity, improve inference speed, and lower energy consumption. However, excessive compression can impair prediction quality. Therefore, engineers must carefully balance how much accuracy they can sacrifice for hardware requirements.

Two complementary compression techniques have proven particularly effective: projection and quantization. By combining these methods, AI models can be specifically optimized for NPUs. Projection reduces model size by eliminating redundant parameters, while quantization further compresses the model by converting remaining parameters into (typically integer) data types with lower memory requirements. Together, these approaches enable compression at both the structural and data type levels, enhancing efficiency without significantly compromising accuracy.

The projection of neural networks is a structural compression technique available in the MATLAB Deep Learning Toolbox. It reduces the number of learnable parameters in a model by projecting the weight matrices of the layers onto low-dimensional subspaces.

Based on principal component analysis (PCA), the key directions of neural activations are identified, and redundant parameters are removed. This reduces memory and computational requirements while largely preserving model accuracy.

After projection comes quantization: it is a compression technique at the data type level that reduces the memory requirements and computational complexity of AI models by converting learnable parameters (weights and biases) from high-precision floating-point values to low-precision integer types. This lowers memory usage and speeds up model inference, especially on NPUs. While quantization introduces some loss of numerical precision, calibrating the model with representative input data generally keeps accuracy within acceptable limits for real-time applications.

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Unfold for details of your consent

Implementation of Projection And Quantization Techniques at STMicroelectronics

A practical example is provided by STMicroelectronics, a global manufacturer of semiconductors and microelectronics that develops chips for cars, smartphones, industry, and IoT devices. Engineers created a workflow using MATLAB® and Simulink® to deploy deep learning models on STM32 microcontrollers. The process began with designing and training the model, followed by hyperparameter optimization and knowledge distillation to reduce model complexity.

In the next step, they applied projection to structurally compress the model and remove redundant parameters. This was followed by quantization, where weights and activations were converted into 8-bit integers. This significantly reduced memory requirements while increasing execution speed. This two-step compression approach enables the deployment of deep learning models on resource-constrained NPUs and MCUs without compromising real-time performance.

Best Practices for Deploying AI Models on NPUs

Comparison of accuracy, model size, and inference speed of a recurrent neural network with an LSTM layer for modeling the state of charge of a battery—before and after projection with fine-tuning.(Image: MathWorks)
Comparison of accuracy, model size, and inference speed of a recurrent neural network with an LSTM layer for modeling the state of charge of a battery—before and after projection with fine-tuning.
(Image: MathWorks)

Model compression techniques such as projection and quantization can significantly improve the performance and applicability of AI models on NPUs. However, since compression can affect accuracy, iterative testing—both in simulation and with processor-in-the-loop validation—is essential to ensure that the models meet functional and resource-related requirements.

Early and frequent testing enables engineers to identify and resolve issues early on, reducing the risk of rework in later development stages. This supports a seamless deployment in embedded systems.

A unified ecosystem can also address many challenges in deploying AI models by simplifying integration, accelerating development, and supporting comprehensive testing throughout the entire process. This is particularly valuable in today's fragmented software landscape, where engineers often need to integrate different codebases into their simulation workflows or larger system environments. The integration of NPUs further increases the complexity of the toolchain—another reason for the necessity of a unified ecosystem.

With the MATLAB® Deep Learning Toolbox, engineers can design, simulate, and optimize compressed AI models. This enables them to meet application-specific requirements for speed, accuracy, and efficiency on NPU hardware. At the same time, the future of embedded AI lies in powerful, edge-optimized hardware architectures that control complex technical systems. Success depends on the right balance between model compression, early hardware testing, and adaptable systems. (sg)

Christoph Stockhammer is Principal Application Engineer at MathWorks