Qwen2.5-Omni-7B: The Ultimate End-to-End Multimodal AI Model

💡Interested in the latest trend in AI? Then, You cannot miss out Anakin AI! Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Deepseek, OpenAI's o3-mini-high, Claude 3.7 Sonnet, FLUX, Minimax Video, Hunyuan... Build Your

1000+ Pre-built AI Apps for Any Use Case

Qwen2.5-Omni-7B: The Ultimate End-to-End Multimodal AI Model

Start for free
Contents
💡
Interested in the latest trend in AI?

Then, You cannot miss out Anakin AI!

Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Deepseek, OpenAI's o3-mini-high, Claude 3.7 Sonnet, FLUX, Minimax Video, Hunyuan...

Build Your Dream AI App within minutes, not weeks with Anakin AI!
Anakin AI: Your All-in-One AI Platform
Anakin AI: Your All-in-One AI Platform

Introduction

Qwen2.5-Omni-7B represents the latest breakthrough in multimodal AI technology from the Qwen team at Alibaba Cloud. Released as part of the Qwen2.5 series, this 7B parameter model marks a significant advancement in end-to-end multimodal capabilities, able to perceive and process diverse input modalities including text, images, audio, and video, while simultaneously generating both text and natural speech responses in a streaming manner.

What sets Qwen2.5-Omni-7B apart is its exceptional versatility and performance across all modalities, making it a truly "omni" model for various AI applications. The model's ability to handle speech, vision, and text simultaneously places it among the most advanced open-source multimodal models currently available.

Key Features and Capabilities

Novel Thinker-Talker Architecture

At the heart of Qwen2.5-Omni-7B lies its innovative Thinker-Talker architecture, specifically designed for comprehensive multimodal perception. This architecture enables the model to:

  • Process multiple input modalities simultaneously
  • Generate both text and speech outputs
  • Provide streaming responses in real-time

The architecture includes a novel position embedding system called TMRoPE (Time-aligned Multimodal RoPE), which synchronizes timestamps of video inputs with audio, enabling more coherent multimodal understanding.

Real-Time Voice and Video Chat

The model is built for fully real-time interactions, supporting chunked input processing and immediate output generation. This capability is crucial for applications requiring natural conversational flow, such as virtual assistants and interactive systems.

Natural and Robust Speech Generation

Qwen2.5-Omni-7B demonstrates superior speech generation capabilities compared to many existing streaming and non-streaming alternatives. The model's speech output is characterized by exceptional robustness and naturalness, making it suitable for applications where high-quality voice output is essential.

Strong Cross-Modal Performance

When benchmarked against similarly sized single-modality models, Qwen2.5-Omni-7B exhibits exceptional performance across all modalities. It outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B in vision-language tasks, demonstrating its versatility as a true multimodal system.

Excellent Speech Instruction Following

One of the most impressive aspects of Qwen2.5-Omni-7B is its ability to follow instructions through speech input with performance rivaling its text input capabilities. This is evidenced by its strong performance on benchmarks such as MMLU and GSM8K when provided with speech input, showing that the model maintains high cognitive capabilities regardless of input modality.

Benchmark Performance

Qwen2.5-Omni-7B has undergone comprehensive evaluation across multiple benchmarks, consistently demonstrating strong performance in various domains:

Multimodal Benchmarks

In OmniBench, which tests performance across speech, sound events, and music understanding:

  • Qwen2.5-Omni-7B: 56.13% average performance
  • Gemini-1.5-Pro: 42.91%
  • Baichuan-Omni-1.5: 42.90%
  • MiniCPM-o: 40.50%

This places Qwen2.5-Omni-7B at the state-of-the-art for multimodal understanding tasks among models of comparable size.

Audio Processing

For speech recognition on LibriSpeech:

  • Qwen2.5-Omni-7B: 1.8 WER on test-clean, 3.4 WER on test-other
  • Qwen2-Audio: 1.6 WER on test-clean, 3.6 WER on test-other
  • Whisper-large-v3: 1.8 WER on test-clean, 3.6 WER on test-other

For audio understanding on MMAU:

  • Qwen2.5-Omni-7B: 65.60% (average)
  • Gemini-Pro-V1.5: 54.90%
  • Qwen2-Audio: 49.20%

Image and Video Understanding

On image understanding benchmarks:

  • MMMU val: 59.2% (compared to 60.0% for GPT-4o-mini and 58.6% for Qwen2.5-VL-7B)
  • MMBench-V1.1-EN test: 81.8% (compared to 82.6% for Qwen2.5-VL-7B and 76.0% for GPT-4o-mini)

For video understanding:

  • MVBench: 70.3% (compared to 69.6% for Qwen2.5-VL-7B)
  • Video-MME without subtitles: 64.3% (compared to 65.1% for Qwen2.5-VL-7B)

Text-Only Benchmarks

Despite being a multimodal model, Qwen2.5-Omni-7B maintains strong performance on text-only benchmarks:

  • MMLU-redux: 71.0% (compared to 75.4% for Qwen2.5-7B)
  • GSM8K: 88.7% (compared to 91.6% for Qwen2.5-7B)
  • HumanEval: 78.7% (compared to 84.8% for Qwen2.5-7B)

While the text-only performance is slightly below its specialized text counterpart (Qwen2.5-7B), it significantly outperforms many comparable models like Llama3.1-8B and Gemma2-9B across most benchmarks.

Running Qwen2.5-Omni-7B Locally

Setting up and running Qwen2.5-Omni-7B locally requires some preparation due to its multimodal requirements. Here's a comprehensive guide to get started:

System Requirements

To run Qwen2.5-Omni-7B effectively, you'll need:

  • CUDA-compatible GPU with sufficient memory:
  • For 15s video: 31.11 GB (BF16)
  • For 30s video: 41.85 GB (BF16)
  • For 60s video: 60.19 GB (BF16)
  • Note: Actual memory usage is typically 1.2x higher than these theoretical minimums
  • Software requirements:
  • Python 3.8+
  • PyTorch 2.0+
  • FFmpeg (for audio/video processing)

Installation Steps

Install necessary packages:

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@3a1ead0aabed473eafe527915eea8c197d424356
pip install accelerate
pip install qwen-omni-utils[decord]

Install Flash Attention 2 (optional but recommended for performance):

pip install -U flash-attn --no-build-isolation

Basic Usage Example

Here's a basic example of how to use Qwen2.5-Omni-7B with Transformers:

import soundfile as sf
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# Load the model
model = Qwen2_5OmniModel.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    # Uncomment for better performance with compatible hardware
    # attn_implementation="flash_attention_2",
)

# Load the processor
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

# Prepare conversation
conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "<https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4>"},
        ],
    },
]

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audios=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True
)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)

# Save audio output
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

Usage Tips

Audio Output Requirements

To enable audio output, the system prompt must be set exactly as shown:

{
    "role": "system",
    "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
}

Voice Type Selection

Qwen2.5-Omni-7B supports two voice types:

  • Chelsie (Female): A honeyed, velvety voice with gentle warmth and luminous clarity
  • Ethan (Male): A bright, upbeat voice with infectious energy and warmth

You can specify the voice using the spk parameter:

text_ids, audio = model.generate(**inputs, spk="Ethan")

Video Processing Options

Video URL compatibility depends on the third-party library version:

  • torchvision >= 0.19.0: Supports both HTTP and HTTPS
  • decord: Supports HTTP only

You can change the backend by setting environment variables:

FORCE_QWENVL_VIDEO_READER=torchvision
# or
FORCE_QWENVL_VIDEO_READER=decord

Docker Deployment

For simplified deployment, you can use the official Docker image:

docker run --gpus all --ipc=host --network=host --rm --name qwen2.5-omni -it qwenllm/qwen-omni:2.5-cu121 bash

To launch the web demo through Docker:

bash docker/docker_web_demo.sh --checkpoint /path/to/Qwen2.5-Omni-7B --flash-attn2

vLLM Deployment

For faster inference, vLLM is recommended:

Install vLLM with Qwen2.5-Omni support:

pip install git+https://github.com/huggingface/transformers@1d04f0d44251be5e236484f8c8a00e1c7aa69022
pip install accelerate
pip install qwen-omni-utils
git clone -b qwen2_omni_public_v1 <https://github.com/fyabc/vllm.git> vllm
cd vllm
pip install .

Basic vLLM usage (text-only output currently supported):

import os
import torch
from transformers import Qwen2_5OmniProcessor
from vllm import LLM, SamplingParams
from qwen_omni_utils import process_mm_info

os.environ['VLLM_USE_V1'] = '0'  # vLLM engine v1 not supported yet
MODEL_PATH = "Qwen/Qwen2.5-Omni-7B"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    tensor_parallel_size=torch.cuda.device_count(),
    limit_mm_per_prompt={'image': 1, 'video': 1, 'audio': 1},
    seed=1234
)

# Process inputs and generate outputs as shown in the example

Conclusion

Qwen2.5-Omni-7B represents a significant advancement in multimodal AI technology, offering impressive performance across text, image, audio, and video processing in a single model. With its 7B parameter size, it provides a good balance between capability and resource requirements, making it accessible for various deployment scenarios.

The model's ability to not only understand multiple modalities but also generate both text and speech outputs opens up numerous possibilities for applications in virtual assistants, content creation, accessibility tools, and much more. Its competitive performance against larger specialized models demonstrates the effectiveness of its architecture and training approach.

As AI continues to evolve toward more human-like interaction capabilities, models like Qwen2.5-Omni-7B represent an important step forward in creating more natural and versatile artificial intelligence systems that can seamlessly bridge multiple forms of communication.