Alibaba's QWEN3-OMNI model exceeds GPT-4O and Gemini in key Aa tests

By Canuto

ALIBABA TECHNOLOGICAL GIANTS OF US. Alibaba presents QWen3-OMNI, an open AI model capable of integrating text, image, audio and video.

***

QWEN3-OMNI offers three specialized versions for different business uses.
The model exceeds GPT-4O and Gemini 2.5 Pro in multiple reference tests.
Alibaba bets on open source with Apache 2.0 license for companies and developers.

The technological advance in artificial intelligence continues to accelerate and now comes strongly from China. Alibaba, giant of electronic commerce and cloud computing, presented QWEN3-OMNI, an open language model that accepts text, images, audio and video such as entries. With this launch, the company challenges the technological leaders of the United States, including OpenAI and Google, offering advanced multimodal capabilities under Apache 2.0 license, which allows its commercial use without license costs.

QWEN3-OMNI arises in a context where Nvidia announced an investment of USD $ 100,000 million in OpenAI data centers, reinforcing competition in the sector. While the great American technology concentrate their efforts on proprietary models, Alibaba is committed to opening their technology, allowing companies and developers to download, modify and deploy their model for free.

A “omni” model that integrates everything

Alibaba’s proposal is distinguished by integrating from zero text, image, audio and video into a single system, avoiding the fragmentation that characterized previous models. Unlike GPT-4O, which unified text, image and audio, and Gemini 2.5 Pro, which also analyzes video but is closed, qwen3-oomni is totally open. In addition, it exceeds in several indicators to Gemma 3n, the closest Google alternative in open source.

The model can receive multimodal data and respond in text or audio. This dual departure capacity makes it ideal for business and customer service applications that require real -time interaction. Alibaba Cloud already offers QWen3-OMNI in Hugging Face, Github and in its own API with a quick version called “Flash”.

Three versions for different needs

Alibaba has launched three variants of QWEN3-OMNI to cover different scenarios of use. The “Instruct” version combines the Thinker and Talker components, offering audio, video and text entries, and text and voice outings. The “Thinking” version focuses on tasks of reasoning and processing long chain, accepting the same entrances but limiting its exit to the text. Finally, the “Captainoner” version is optimized to subtitle audio with precision and low hallucination rate.

This segmentation allows developers to choose between broad multimodal interaction, deep reasoning or specialized auditory understanding according to their objectives. The model supports 119 languages in text, 19 for voice entrance and 10 for voice output, including dialects such as Cantonese.

Technical Design and Performance

QWen3-OMNI adopts a Thinker-Talker architecture. The Thinker component manages multimodal reasoning and understanding, while Talker generates natural voice. This is based on a Mixture-Of-Experts (MOE) design that improves the concurrence and inference speed. Talker is based directly on audio and video characteristics, achieving a more natural prosody and bell in translation and dialogue.

The model records theoretical latencies of 0.234 seconds for audio and 0.547 seconds for video, staying below the real -time factor even with multiple applications. Its audio encoder, Audio Transformer (Aut), was trained at 20 million hours of supervised data, with 80 % in Chinese and English and the rest in other languages and auditory comprehension tasks.

Prices and accessibility

Alibaba established a tokens collection system in its API, with variable costs according to type of input and output. For example, the entry text costs USD $ 0.00025 per 1,000 tokens and the text output plus audio USD $ 0.00876 for 1,000 tokens in the audio part, being the free text. This structure seeks to encourage mass adoption by developers and companies.

Being under the Apache 2.0 license, QWEN3-OMNI allows commercial use, modifications and redistribution without opening derivatives, reducing legal risks and promoting integration into proprietary systems. This opening could boost new transcription, translation, OCR solutions, musical labeling and video analysis.

Business and future impact of Qwen

For companies, qwen3 -omni represents an opportunity to incorporate advanced multimodal without license costs or contractual restrictions. They can adapt the model to specific sectors or local regulations and benefit from community contributions. This approach contrasts with the barriers of closed models, which usually demand payments and limit customization.

With this launch, Alibaba reinforces its strategy to compete globally in AI, showing that innovation is not exclusive to Silicon Valley. QWEN3-OMNI could mark a before and after in the adoption of open multimodal models, offering developers and companies powerful tools for new interactive and multilingual experiences.

Original image of Diariobitcoin, created with artificial intelligence, for free use, licensed under public domain.

This article was written by an AI content editor and reviewed by a human editor to guarantee quality and precision.*

WARNING: Diariobitcoin offers informative and educational content on various topics, including cryptocurrencies, AI, technology and regulations. We do not provide financial advice. Cryptactive investments are high risk and may not be adequate for all. Investigate, consult an expert and verify the applicable legislation before investing. I could lose all its capital.

Alibaba’s QWEN3-OMNI model exceeds GPT-4O and Gemini in key Aa tests