Qwen-AgentWorld Deployment Guide: Run the Open 35B Language World Model Locally

A practical English guide to Qwen-AgentWorld, Alibaba Qwen's language world model for AI agents. Learn its seven-domain design, training pipeline, AgentWorldBench results, SGLang and vLLM deployment commands, local Transformers inference, API calls, evaluation workflow, and fine-tuning options.

发布于 2026年7月3日generalGEO 评分: 55
Qwen-AgentWorldQwen AgentWorld deploymentlanguage world modelAI agent simulatorAgentWorldBenchQwen 35B A3BSGLang Qwen AgentWorldvLLM Qwen AgentWorldTransformers local inferenceQwen language world modelagent environment simulationMCP agentSWE agent benchmarkAI agent training
A clean 16:9 tech blog cover with a dark AI infrastructure background, a central Qwen-AgentWorld node, seven connected agent domains, and a small deployment terminal panel. Minimal, modern, and focused on AI agent simulation.

Qwen-AgentWorld is a language world model released by the Qwen team for simulating agent environments. Instead of only answering questions like a general chat model, it is designed to predict what an environment would return after an agent takes an action.

This makes it especially relevant for AI agent research, simulated reinforcement learning, benchmark evaluation, and local experiments around terminal, software engineering, search, MCP, web, operating system, and Android-style environments.

This article is a lightly rewritten and translated version of the original Chinese article. The structure, technical flow, commands, tables, and key ideas are preserved, while the language has been adjusted for smoother English reading and SEO publishing.

Source note: The original article was published on CSDN and states that it follows the CC BY-SA 4.0 license. Original source: Qwen-AgentWorld完整部署指南:免费开源,性能超GPT-5.4,5分钟跑起来. Verification note: Official Qwen pages confirm the public release of Qwen-AgentWorld-35B-A3B model weights and AgentWorldBench. The larger Qwen-AgentWorld-397B-A17B is included in official benchmark results, but the public model page and GitHub release primarily point to the 35B-A3B model weights.


1. Background: Why Do We Need a Language World Model?

Over the past two years, AI agents have moved quickly from simple chat assistants into tools that can operate websites, run terminal commands, control mobile apps, and complete software engineering tasks.

But training a strong agent is expensive. It often requires large volumes of real environment interaction, and that creates several practical problems:

  • Building and maintaining environments is tedious.

  • Data collection is slow and hard to scale.

  • Real environments carry risk, especially when testing failure cases or injecting controlled disruptions.

A Language World Model, or LWM, is built to solve this problem. The idea is simple but powerful: let a model play the role of the environment. Given an agent action and the interaction history, the model predicts the next environment state.

With that setup, agents can be trained and evaluated in simulation instead of always relying on real systems.

On 2026-06-24, the Qwen team released Qwen-AgentWorld, a native language world model that unifies seven agent interaction domains in one model. The companion benchmark, AgentWorldBench, was also released.

Official resources:

GitHub: QwenLM/Qwen-AgentWorld


2. Core Idea: What Makes It a “Native” World Model?

The word native is important here. Qwen-AgentWorld is not just a general-purpose LLM adapted after training to imitate an environment. Its world-modeling goal is built into the training process from the beginning.

Comparison Dimension

Traditional Approach

Qwen-AgentWorld

Training starting point

Fine-tune a general LLM

Treat environment modeling as the goal from CPT onward

Training process

Usually SFT or RL only

CPT → SFT → RL

Environment knowledge

Added through extra data or adaptation

Internalized during training

Domain coverage

One or a few domains

Seven domains in one model

In other words, Qwen-AgentWorld is not just a general model wrapped with prompts. It is trained from the lower layers of the pipeline to predict the next state of an environment.

That gives the model a more structured understanding of environment dynamics, especially when simulating long interaction trajectories.


3. Seven Domains: Text and GUI Environments in One Model

Qwen-AgentWorld splits agent interaction scenarios into two large groups: text-based environments and GUI-based environments.

┌──────────────────────────────────────────┐
│             Qwen-AgentWorld              │
│                                          │
│  Text Environments    GUI Environments   │
│  ┌──────────┐       ┌──────────────────┐ │
│  │  MCP     │       │  Web             │ │
│  │  Search  │       │  OS              │ │
│  │  Terminal│       │  Android         │ │
│  │  SWE     │       └──────────────────┘ │
│  └──────────┘                            │
└──────────────────────────────────────────┘

Domain

Type

Description

MCP

Text

Tool calling and Model Context Protocol interactions

Search

Text

Search engine interaction and retrieval behavior

Terminal

Text

Linux terminal command execution

SWE

Text

Software engineering tasks, such as code fixes

Web

GUI

Browser and webpage interaction

OS

GUI

Desktop operating system interaction

Android

GUI

Mobile app and Android-style UI interaction

For the three GUI domains, observations are represented as renderable code rather than raw pixel frames. This lets a text-based world model cover visual environments without directly processing full image sequences.

The model was trained on more than 10 million real-world interaction trajectories across the seven domains.


4. Three-Stage Training Pipeline

Qwen-AgentWorld uses a connected three-stage training pipeline: CPT → SFT → RL.

Stage 1: CPT — Injecting Environment Knowledge

During continual pre-training, the model learns from large-scale real environment interaction trajectories. This stage embeds environment dynamics into the model weights.

The original article also mentions a turn-level information-theoretic loss mask. The goal is to identify which dialogue turns actually carry environment-state information and reduce noise from less useful turns.

Stage 2: SFT — Activating Chain-of-Thought Reasoning

Supervised fine-tuning turns next-state prediction into a chain-of-thought style reasoning pattern.

Instead of directly outputting a predicted result, the model learns to reason through why a state should change before generating the next observation.

Stage 3: RL — Refining Simulation Fidelity

The reinforcement learning stage uses hybrid reward signals, including the GSPO algorithm, to improve output quality.

The optimization focuses on:

  • Format correctness

  • Factual accuracy

Context consistency

Realism

Overall simulation quality

Emergent behaviors mentioned in the original article: Qwen-AgentWorld reportedly shows self-correction behavior, information-leakage prevention in search scenarios, and multi-step causal reasoning for some command-output predictions.


5. Open-Source Model List

Release

Parameters

Activated Parameters

Context Length

Positioning

Qwen-AgentWorld-35B-A3B

35B

3B

256K tokens

Public, efficient open model

Qwen-AgentWorld-397B-A17B

397B

17B

Not clearly listed in the original table

Flagship benchmark model

AgentWorldBench

Evaluation benchmark

35B-A3B Architecture Details

  • Base model: Qwen3.5-35B-A3B-Base

  • Model type: Causal Language Model / Language World Model

  • Architecture style: Hybrid linear attention + MoE

  • Hidden dimension: 2048

  • Layers: 40 layers

  • Layer layout: repeated groups with Gated DeltaNet, Gated Attention, and MoE components

  • Experts: 256 experts

Activated experts: 8 routed experts + 1 shared expert

Context length: 262,144 tokens

  • Recommended minimum context: 128K tokens for better long-trajectory simulation quality

Official Hugging Face documentation also notes that the model is compatible with Transformers, vLLM, and SGLang.


6. Performance Comparison: AgentWorldBench Results

AgentWorldBench scores each model across five dimensions: Format, Factuality, Consistency, Realism, and Quality. Scores are normalized to a 0–100 scale, where higher is better.

Full Ranking by Overall Score

Model

MCP

Search

Terminal

SWE

Android

Web

OS

Overall

Qwen-AgentWorld-397B-A17B

68.24

37.82

57.73

68.49

60.20

50.98

67.89

58.71

GPT-5.4

70.10

37.26

53.69

66.29

60.00

51.80

68.58

58.25

Claude Opus 4.6

69.90

29.30

57.51

64.55

61.74

51.42

70.20

57.80

Claude Opus 4.8

54.93

35.14

59.18

64.10

61.50

54.66

66.62

56.59

Qwen-AgentWorld-35B-A3B

64.79

36.69

53.96

65.63

58.17

49.55

65.92

56.39

Claude Sonnet 4.6

70.00

28.79

56.98

64.52

58.03

50.78

63.17

56.04

Qwen3.5-397B-A17B

68.31

30.81

55.30

64.44

54.90

48.55

60.85

54.74

Gemini 3.1 Pro

59.07

30.21

52.47

59.07

61.40

52.83

66.92

54.57

DeepSeek-V4-Pro

63.27

27.61

51.26

59.44

55.17

50.32

63.70

52.97

Qwen3.5-35B-A3B

57.87

25.98

46.13

47.58

53.18

47.10

56.27

47.73

Key takeaways from the original article:

  • Qwen-AgentWorld-397B-A17B reaches an overall score of 58.71 and ranks first in the listed AgentWorldBench table.

  • Qwen-AgentWorld-35B-A3B improves by +8.66 points over the base Qwen3.5-35B-A3B model.

Practical note: Treat benchmark numbers as reference data from the official benchmark setup. Real results will depend on hardware, prompt design, serving framework, context length, and the environment being simulated.


7. Four Application Patterns and Experimental Results

Pattern 1: Generalizable OOD Environment Expansion

The original article describes using Qwen-AgentWorld-397B-A17B for simulated RL across 4,000 out-of-distribution OpenClaw environments, then testing zero-shot generalization in new domains.

Training Method

Claw-Eval

QwenClawBench

Base SFT

65.4

47.9

Sim RL with a general model simulator

66.7

47.8

Sim RL with Qwen-AgentWorld simulator

69.7

55.0

Improvement

+4.3

+7.1

Pattern 2: Controllable Simulation — MCP Targeted Perturbation

Controlled perturbations can expose weak points in an agent more effectively than standard real-environment training.

Configuration

Tool Decathlon

MCPMark

Base SFT

32.4

21.5

Sim RL without control

31.5

24.6

Sim RL with control

36.1

33.8

Improvement

+3.7

+12.3

Pattern 3: Fictional World Construction — Search Domain

The Search-domain experiment uses a fictional but self-consistent search world for training, then evaluates generalization on real search tasks.

Configuration

WideSearch F1 Item

WideSearch F1 Row

Base SFT, 35B

34.02

13.72

+ Sim RL fictional world

50.31

24.21

Improvement

+16.29

+10.49

Pattern 4: Agent Foundation Model — LWM RL Warm-Up Transfer

The article also describes LWM RL warm-up as a way to improve downstream agent performance without extra RL fine-tuning on those specific tasks.

Metric

Terminal-Bench 2.0

SWE-Bench Verified

SWE-Bench Pro

WideSearch F1

Claw-Eval

BFCL v4

Base SFT

33.25

64.47

42.18

33.38

53.60

62.29

+ LWM RL warm-up

39.55

67.86

47.42

46.17

64.88

71.25

Improvement

+6.30

+3.39

+5.24

+12.79

+11.28

+8.96

Highlight: The warm-up data comes from single-turn, non-agentic trajectories, yet the improvement transfers to more complex multi-turn tool-calling agent tasks. That suggests world-modeling knowledge can transfer beyond its original training format.


8. Quick Deployment Guide

Method 1: Deploy with SGLang

SGLang is recommended in the original article for fast serving.

pip install sglang
python -m sglang.launch_server \
    --model-path Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tp-size 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

After startup, the OpenAI-compatible API endpoint is:

http://localhost:8000/v1

Method 2: Deploy with vLLM

pip install vllm
vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --trust-remote-code

Official-docs note: The current Hugging Face model card also recommends using --language-model-only with vLLM because the model architecture includes visual component definitions while the checkpoint contains language model weights. If vLLM initialization fails, try adding that flag.

vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --language-model-only \
    --trust-remote-code

Method 3: Local Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen-AgentWorld-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

# Terminal-domain example: ask the model to predict command output.
messages = [
    {
        "role": "system",
        "content": "You are a language world model simulating a Linux terminal environment. "
                   "Given the user's command, predict the terminal output."
    },
    {
        "role": "user",
        "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"
    }
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Method 4: Call Through an OpenAI-Compatible API

This method works after serving the model through SGLang or vLLM.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

messages = [
    {
        "role": "system",
        "content": "You are a language world model simulating a Linux terminal environment."
    },
    {
        "role": "user",
        "content": "Action: execute_bash\nCommand: pwd"
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen-AgentWorld-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.6,
)

print(response.choices[0].message.content)

Best Practices

  • Recommended sampling: temperature=0.6, top_p=0.95, top_k=20

Recommended output length: around 32,768 tokens for most long observations

  • Use the domain-specific system prompts from the repository prompts/ directory for better simulation quality

  • Keep context length at least 128K when possible; the default model context is 256K


9. AgentWorldBench Evaluation Workflow

If you want to test your own world model on AgentWorldBench, the original article gives a three-step workflow.

# 1. Clone the evaluation repository
git clone https://github.com/QwenLM/Qwen-AgentWorld.git
cd Qwen-AgentWorld

# 2. Download the evaluation dataset
huggingface-cli download Qwen/AgentWorldBench --repo-type dataset --local-dir ./AgentWorldBench

# 3. Install dependencies
pip install openai

cd eval

# Step 1: world model inference
python eval.py infer \
    --data-dir ../AgentWorldBench \
    --model-base-url http://localhost:8000/v1 \
    --model-name Qwen/Qwen-AgentWorld-35B-A3B \
    --output-dir ./results

# Step 2: LLM judge scoring. This requires an OpenAI API key.
export OPENAI_API_KEY="your-api-key"
python eval.py judge \
    --predictions ./results/predictions.jsonl \
    --judge-base-url https://api.openai.com/v1 \
    --judge-model gpt-5.2-2025-12-11 \
    --output-dir ./results

# Step 3: aggregate scores
python eval.py score --predictions ./results/judged.jsonl

Each test sample includes ground-truth observation data from real environment execution. The benchmark evaluates world-modeling ability across format, factuality, consistency, realism, and quality.


10. Fine-Tuning Suggestions

If you want to customize Qwen-AgentWorld for a specific domain, the original article recommends three common fine-tuning frameworks.

Framework

Strength

Suitable Scenario

ms-swift

High integration with ModelScope

Fast experiments and Alibaba ecosystem workflows

LLaMA-Factory

Active community and broad training strategy support

Practical engineering deployment

Unsloth

Strong memory optimization

Resource-constrained fine-tuning


11. Source Notes and Image Handling

The original article includes several images related to Qwen-AgentWorld domains and benchmark results. These were kept in the relevant sections.

CSDN platform icons, promotion modules, author subscription blocks, QR codes, reward buttons, and unrelated recommendation images were removed according to the publishing requirements.


FAQ

What is Qwen-AgentWorld?

Qwen-AgentWorld is a language world model from the Qwen team. It predicts the next environment state after an agent takes an action, making it useful for agent simulation, training, and evaluation.

Is Qwen-AgentWorld the same as a normal chat model?

No. A normal chat model is mainly optimized for conversation and instruction following. Qwen-AgentWorld is trained as an environment simulator, so its main use case is predicting observations in agent interaction environments.

Which Qwen-AgentWorld model is publicly available?

Official pages list Qwen-AgentWorld-35B-A3B as the publicly released model weight. AgentWorldBench is also available as an evaluation benchmark. The larger 397B model appears in benchmark tables, but the public model release mainly points to the 35B-A3B version.

Can Qwen-AgentWorld be deployed with vLLM?

Yes. The Hugging Face model card includes a vLLM serving example. If you run into initialization issues, the official model card recommends adding --language-model-only because the checkpoint contains language model weights.

Can Qwen-AgentWorld be deployed with SGLang?

Yes. SGLang is one of the recommended serving options and can expose an OpenAI-compatible API endpoint. The model can then be called through local API requests.

Why does Qwen-AgentWorld need a long context window?

Agent environment simulation often depends on long interaction histories. A shorter context window may lose important state information, so the official guidance recommends keeping at least 128K tokens when possible.

What is AgentWorldBench used for?

AgentWorldBench is the benchmark released with Qwen-AgentWorld. It evaluates language world models across seven domains using dimensions such as format, factuality, consistency, realism, and quality.

Is Qwen-AgentWorld suitable for production use?

It can be useful for research, evaluation, simulation, and internal experiments. For production systems, you still need to evaluate latency, hardware cost, safety, prompt reliability, and whether simulated results match your real environment closely enough.


Related Tools

  • Qwen-AgentWorld GitHub: Official repository for Qwen-AgentWorld code, prompts, and evaluation workflow.

  • Qwen-AgentWorld-35B-A3B on Hugging Face: Official model page for the public 35B-A3B weights.

  • AgentWorldBench: Official benchmark dataset for evaluating language world models.

  • SGLang: A fast serving framework for large language models.

  • vLLM: A high-throughput inference engine for serving LLMs.

  • Transformers: Hugging Face library for local model loading and inference.

  • OpenAI Python SDK: Python client that can call OpenAI-compatible local model servers.

  • ms-swift: ModelScope’s training and fine-tuning framework for LLM workflows.


Related Links