World-Action Versus Vision-Language-Action Model AI: Which Language Model Will Prevail for Automated Driving?

From Henrik Bork | Translated by AI 5 min Reading Time

Related Vendors

In China's automotive industry, there is currently debate about the right technological path to automated driving. At professional conferences, in industry media, and on social networks, two AI models clash: the World-Action Model and the Vision-Language-Action Model.

VLA or WA – that is the question. In China's automotive industry, there is currently debate about the right technological path to automated driving.(Image: © Choi_ Nikolai - stock.adobe.com)
VLA or WA – that is the question. In China's automotive industry, there is currently debate about the right technological path to automated driving.
(Image: © Choi_ Nikolai - stock.adobe.com)

Top executives of leading Chinese companies are publicly taking a stand and are not afraid to adopt sharp tones. Recently, He Xiaopeng, founder and CEO of the electric car manufacturer Xpeng, openly questioned the competition. He stated that he did not know of any Chinese manufacturer that had developed a genuine Vision-Language-Action Model (VLA) “instead of just a deformed version.” According to He, who is always very self-confident, Xpeng was the only company in China to have achieved this, to the best of his knowledge. Although He did not name names, it was clear that he was primarily referring to rival Li Auto, which had previously announced the production readiness of its own VLA system.

At the end of August, Huawei also joined the debate. The technology company, which has developed into an influential supplier of driver assistance and autonomous systems in recent years, remains unwavering in its commitment to the World-Action Model (WA). Jin Yuzhi, head of Huawei’s Intelligent Automotive division, made it clear that his company would not follow the VLA trend. "Huawei will not take the VLA path. Huawei places more emphasis on WA, or World Action, which skips the language step," Jin emphasized in an interview. VLA attempts to convert video data into "linguistic tokens" using advanced language model technology and derive vehicle control commands from them. While this approach may seem clever and has helped some automakers make quick progress in assistance functions, Jin argued that it is not the key to true autonomy.

Huawei, instead, relies on a direct end-to-end model, where sensor data—whether visual impressions, sounds, or other signals—is converted directly into driving actions without detouring through language processing. While this approach may currently seem particularly demanding at first glance, Jin Yuzhi is confident that it is the only way to enable fully autonomous driving.

Advantages of Language Models

The idea behind VLA is to use a large language model (LLM) for driving automation. Camera images and other sensor data are translated into descriptive language, which an AI system logically analyzes to then make corresponding driving decisions. Several Chinese automakers, led by XPeng and Li Auto, have made significant progress with this approach in recent months.

Li Auto integrated an initial "MindVLA" function into its production vehicles, and Xpeng announced that its new P7 model would receive a VLA-based system via a software update this fall. Observers are referring to a potential "shortcut" to highly advanced driver assistance systems. By utilizing existing large language models and massive datasets, these companies were able to significantly enhance their autonomous driving functions in a short period of time, it was reported.

Xpeng, for example, developed its own base model with 72 billion parameters, which is simplified through distillation to run in its vehicles. Li Auto, on the other hand, pursues a hybrid approach. A small VLA model component operates in the vehicle, while a large "world model AI" simulates scenarios in the data center and continuously improves the system.

Critics like He Xiaopeng argue that Li Auto has merely "patched together" VLA and is using the buzzword without having a fully functional model on board. Huawei, on the other hand, strategically and unwaveringly pursues the traditional, sensor-based approach. The company's proprietary Autonomous Driving Solution (ADS) system is already integrated into over one million vehicles, which together have completed more than four billion kilometers of assisted driving.

Based on the WA principle, Huawei has further refined this approach and developed the World Engine, World Action (WEWA) architecture. This is utilized in the new ADS 4.0 platform and aims to enable highly precise autonomous driving through direct sensory world modeling. Huawei emphasizes that WA, without the intermediate language step, offers particular advantages in spatial perception. This is exactly the area where VLA shows weaknesses due to its abstract "language layer." Additionally, Huawei strongly relies on extensive sensor technology—such as multiple LiDAR per vehicle—and high-performance hardware to provide the WA model with as complete environmental information as possible in real time.

New Business Models

Huawei is willing to accept the initially higher costs, as safety reserves and robust performance over the entire vehicle lifecycle are the priorities, according to Jin Yuzhi. He views the fact that some competitors initially offer their driver assistance for free critically, stating, "Nothing in the world is free." Such offers are either time-limited, cross-subsidized in the vehicle price, or simply underdeveloped, effectively using drivers as test pilots, he concluded in a harsh judgment.

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Unfold for details of your consent

Huawei pursues a different business model. Through continuous OTA updates and improvements during the usage period, the systems are meant to continuously learn – a service for which the customer pays but which, according to the company's spokesperson, ultimately provides greater safety and utility in the long run.

No Absolute Truths Yet

This occasionally heated controversy over "VLA versus WA" also has a cultural dimension. Advocates of the new VLA approach hail it as a technological breakthrough. Zhou Guang, head of the startup Yuanrong Qixing, confidently stated that the performance lower limit of the VLA model has now surpassed the upper limit of traditional end-to-end systems, thanks in part to features such as built-in inference chains and complex language understanding modules that characterize VLA.

Industry veterans, however, view the excitement quite calmly. A senior engineer from Horizon Robotics commented that, at their core, all current solutions, whether VLM extension, VLA, or Huawei’s world model, are merely different variations of the end-to-end learning approach.

One should not overestimate the new buzzwords. In fact, the entire industry is in an early "trial-and-error" phase, where different concepts are being tested. Absolute truths do not yet exist.

What Are the Implications of a Competition of Approaches?

Some experts even consider hybrid models conceivable, combining elements of both worlds. What is certain is that China’s automakers are at a crossroads. While companies like Xpeng and Li Auto are moving aggressively with VLA-supported AI, Huawei relies on its data-driven WA concept and years of investment in hardware.

The competition between approaches could shape the development of automated and autonomous driving technically, economically, and strategically. Whether one of the two paths emerges as clearly superior or a combination ultimately proves to be the best solution remains to be seen in the future. (se)