Optimize Edge AI With NPUs And Model Compression

Embedded AI With Real-Time Applications Optimize Edge AI With NPUs and Model Compression

2026-01-13 From Christoph Stockhammer* | Translated by AI 4 min Reading Time

Related Vendors

tmts2026-ticec-banner-en-2000px (https://www.tmts.tw/en)

Taiwan Machine Tool & Accessory Builders' Association (TMBA)

Through the clever combination of NPUs and AI model design with strategic compression techniques, embedded devices can be transformed into efficient, high-performance real-time decision-makers—ready to act in real-time.

An NPU is an economical, energy-efficient solution designed for efficient AI inference and neural network calculations in embedded systems.(Image: Yuichiro Chino / MathWorks) — An NPU is an economical, energy-efficient solution designed for efficient AI inference and neural network calculations in embedded systems.
(Image: Yuichiro Chino / MathWorks)

More and more devices are becoming intelligent. From smartphones and connected sensors to autonomous driving, edge AI is increasingly coming into focus. Edge AI refers to artificial intelligence that makes decisions directly on the device, without routing through the cloud. However, powerful models require significant computing power, memory, and energy resources. At the same time, the need for real-time or near-real-time decisions is growing, further driving demand for high-performance edge solutions. This is precisely where Neural Processing Units (NPUs) come into play: they are specifically designed to run complex AI models with low latency and minimal energy consumption, opening up new practical possibilities. One of the biggest challenges, however, remains minimizing inference time—the time a model needs to make a prediction. Especially in motor control, inference time often needs to be under 10 milliseconds to ensure system stability and responsiveness, and to prevent mechanical stress or damage.

NPUs are specifically designed for AI inference and computations in neural networks. This makes them particularly suitable for embedded systems where computing power is limited and energy efficiency is critical. Unlike CPUs, which serve as general-purpose processors, or GPUs, which are powerful but energy-intensive, NPUs are optimized for the efficient computation of matrix operations, the core of neural networks. While GPUs can also be used for AI inference, NPUs stand out with significantly lower energy consumption and reduced costs.

From an economic perspective, NPUs present an attractive alternative to microcontrollers (MCUs), GPUs, or FPGAs for AI tasks. While chips with integrated NPUs are more expensive to purchase than basic microcontrollers, their added value lies in superior energy efficiency and AI performance. These features reduce operating costs over the long term, extend battery life, and open up new applications for embedded systems. Additionally, NPUs enable real-time AI processing without relying on costly and power-hungry alternatives like GPUs or FPGAs.

Lean AI Models for the Edge: Projection And Quantization in Use

At the same time, NPUs face limitations: memory and energy are constrained. Therefore, model compression is crucial to reduce the size and complexity of models, enabling real-time performance.

Model compression techniques are crucial for deploying large AI models at the edge. They reduce size and complexity, improve inference speed, and lower energy consumption. However, excessive compression can impair prediction quality. Therefore, engineers must carefully balance how much accuracy they can sacrifice for hardware requirements.

Two complementary compression techniques have proven particularly effective: projection and quantization. By combining these methods, AI models can be specifically optimized for NPUs. Projection reduces model size by eliminating redundant parameters, while quantization further compresses the model by converting remaining parameters into (typically integer) data types with lower memory requirements. Together, these approaches enable compression at both the structural and data type levels, enhancing efficiency without significantly compromising accuracy.

The projection of neural networks is a structural compression technique available in the MATLAB Deep Learning Toolbox. It reduces the number of learnable parameters in a model by projecting the weight matrices of the layers onto low-dimensional subspaces.

Based on principal component analysis (PCA), the key directions of neural activations are identified, and redundant parameters are removed. This reduces memory and computational requirements while largely preserving model accuracy.

After projection comes quantization: it is a compression technique at the data type level that reduces the memory requirements and computational complexity of AI models by converting learnable parameters (weights and biases) from high-precision floating-point values to low-precision integer types. This lowers memory usage and speeds up model inference, especially on NPUs. While quantization introduces some loss of numerical precision, calibrating the model with representative input data generally keeps accuracy within acceptable limits for real-time applications.

Subscribe to the newsletter now

Don't Miss out on Our Best Content

Business E-mail

Please enter a valid mailadress.

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Date: 08.12.2025

Naturally, we always handle your personal data responsibly. Any personal data we receive from you is processed in accordance with applicable data protection legislation. For detailed information please see our privacy policy.

Consent to the use of data for promotional purposes

I hereby consent to Vogel Communications Group GmbH & Co. KG, Max-Planck-Str. 7-9, 97082 Würzburg including any affiliated companies according to §§ 15 et seq. AktG (hereafter: Vogel Communications Group) using my e-mail address to send editorial newsletters. A list of all affiliated companies can be found here

Newsletter content may include all products and services of any companies mentioned above, including for example specialist journals and books, events and fairs as well as event-related products and services, print and digital media offers and services such as additional (editorial) newsletters, raffles, lead campaigns, market research both online and offline, specialist webportals and e-learning offers. In case my personal telephone number has also been collected, it may be used for offers of aforementioned products, for services of the companies mentioned above, and market research purposes.

Additionally, my consent also includes the processing of my email address and telephone number for data matching for marketing purposes with select advertising partners such as LinkedIn, Google, and Meta. For this, Vogel Communications Group may transmit said data in hashed form to the advertising partners who then use said data to determine whether I am also a member of the mentioned advertising partner portals. Vogel Communications Group uses this feature for the purposes of re-targeting (up-selling, cross-selling, and customer loyalty), generating so-called look-alike audiences for acquisition of new customers, and as basis for exclusion for on-going advertising campaigns. Further information can be found in section “data matching for marketing purposes”.

In case I access protected data on Internet portals of Vogel Communications Group including any affiliated companies according to §§ 15 et seq. AktG, I need to provide further data in order to register for the access to such content. In return for this free access to editorial content, my data may be used in accordance with this consent for the purposes stated here. This does not apply to data matching for marketing purposes.

Right of revocation

I understand that I can revoke my consent at will. My revocation does not change the lawfulness of data processing that was conducted based on my consent leading up to my revocation. One option to declare my revocation is to use the contact form found at https://contact.vogel.de. In case I no longer wish to receive certain newsletters, I have subscribed to, I can also click on the unsubscribe link included at the end of a newsletter. Further information regarding my right of revocation and the implementation of it as well as the consequences of my revocation can be found in the data protection declaration, section editorial newsletter.

Implementation of Projection And Quantization Techniques at STMicroelectronics

A practical example is provided by STMicroelectronics, a global manufacturer of semiconductors and microelectronics that develops chips for cars, smartphones, industry, and IoT devices. Engineers created a workflow using MATLAB® and Simulink® to deploy deep learning models on STM32 microcontrollers. The process began with designing and training the model, followed by hyperparameter optimization and knowledge distillation to reduce model complexity.

In the next step, they applied projection to structurally compress the model and remove redundant parameters. This was followed by quantization, where weights and activations were converted into 8-bit integers. This significantly reduced memory requirements while increasing execution speed. This two-step compression approach enables the deployment of deep learning models on resource-constrained NPUs and MCUs without compromising real-time performance.

Best Practices for Deploying AI Models on NPUs

Comparison of accuracy, model size, and inference speed of a recurrent neural network with an LSTM layer for modeling the state of charge of a battery—before and after projection with fine-tuning.(Image: MathWorks) — Comparison of accuracy, model size, and inference speed of a recurrent neural network with an LSTM layer for modeling the state of charge of a battery—before and after projection with fine-tuning.
(Image: MathWorks)

Model compression techniques such as projection and quantization can significantly improve the performance and applicability of AI models on NPUs. However, since compression can affect accuracy, iterative testing—both in simulation and with processor-in-the-loop validation—is essential to ensure that the models meet functional and resource-related requirements.

Early and frequent testing enables engineers to identify and resolve issues early on, reducing the risk of rework in later development stages. This supports a seamless deployment in embedded systems.

A unified ecosystem can also address many challenges in deploying AI models by simplifying integration, accelerating development, and supporting comprehensive testing throughout the entire process. This is particularly valuable in today's fragmented software landscape, where engineers often need to integrate different codebases into their simulation workflows or larger system environments. The integration of NPUs further increases the complexity of the toolchain—another reason for the necessity of a unified ecosystem.

With the MATLAB® Deep Learning Toolbox, engineers can design, simulate, and optimize compressed AI models. This enables them to meet application-specific requirements for speed, accuracy, and efficiency on NPU hardware. At the same time, the future of embedded AI lies in powerful, edge-optimized hardware architectures that control complex technical systems. Success depends on the right balance between model compression, early hardware testing, and adaptable systems. (sg)

Christoph Stockhammer is Principal Application Engineer at MathWorks