GLM 4.6V

GLM-4.6V is a large multimodal model designed for high-fidelity visual understanding and long-context reasoning across images, documents, and mixed media. It supports up to 128K tokens, processes complex page layouts and charts directly as visual inputs, and integrates native multimodal function calling to connect perception with downstream tool execution. The model also enables interleaved image-text generation and UI reconstruction workflows, including screenshot-to-HTML synthesis and iterative visual editing.

12/8/2025

131,072 tokens

Specifications

Modalities

Input

image

text

video

Output

text

Supported Parameters

frequency_penalty

include_reasoning

max_tokens

min_p

presence_penalty

reasoning

repetition_penalty

response_format

seed

stop

structured_outputs

temperature

tool_choice

tools

top_k

top_p

Max Output Tokens

131,072