First multimodal Gemini model?

 title: 'Figure 9 | Results on autonomous cyber offense suite. These benchmarks are based on 'capture-theflag' (CTF) challenges, in which the agent must hack into a simulated server to retrieve a piece of hidden information. Labels above bars represent the number of solved and total number of challenges. A challenge is considered solved if the agent succeeds in at least one out of N attempts, where we vary N between 5 and 30 depending on challenge complexity. Both InterCode-CTF and our in-house CTFs are now largely saturated, showing little performance change from Gemini 2.0 to Gemini 2.5 models. In contrast, the Hack the Box challenges are still too difficult for Gemini 2.5 models, and so also give little signal on capability change.'

The Gemini 2.X series are all built to be natively multimodal, supporting long context inputs of >1 million tokens and have native tool use support[1]. This allows them to comprehend vast datasets and handle complex problems from different information sources, including text, audio, images, video and even entire code repositories[1].

The Gemini 2.5 models are sparse mixture-of-experts (MoE) transformers with native multimodal support for text, vision, and audio inputs[1]. Sparse MoE models activate a subset of model parameters per input token by learning to dynamically route tokens to a subset of experts[1].