Multimodal

Multimodal means the system uses more than one mode of information representation, typically text and images, and sometimes audio or video. For example, you can upload a screenshot and ask the model to find bugs.

Models like Gemini are often positioned around image+text workflows, while ChatGPT and Claude offer similar modes depending on version. For pure image generation, see AI image generator.

Key characteristics

Means the model can work with multiple data types like text, images, audio, or video in one flow.
Enables tasks such as analyzing a screenshot, transcribing audio, and responding in text in one coherent step.
Strongly influences which tool or model is best for a given task, especially in practical assistant workflows.