Gemini 2.5 Safety Mechanisms: A Comprehensive Report

Commitment to Responsible Development

Google is committed to developing Gemini responsibly, innovating on safety and security alongside capabilities[1]. This commitment includes training and evaluating models, focusing on automated red teaming, undergoing held-out assurance evaluations on present-day risks, and evaluating the potential for dangerous capabilities to proactively anticipate new and long-term risks[1].

Safety Policies

The Gemini safety policies align with Google’s standard framework, preventing the generation of specific types of harmful content[1]. These policies cover several key areas:

  • Child sexual abuse and exploitation
  • Hate speech (e.g., dehumanizing members of protected groups)
  • Dangerous content (e.g., promoting suicide or instructing in activities that could cause real-world harm)
  • Harassment (e.g., encouraging violence against people)
  • Sexually explicit content
  • Medical advice that runs contrary to scientific or medical consensus

These policies apply across modalities, aiming to minimize harmful outputs irrespective of input type[1]. From a security standpoint, Gemini strives to protect users from cyberattacks, for example, by being robust to prompt injection attacks[1].

Helpfulness Desiderata

Defining what the model should do is equally important as defining what it should not do[1]. The desiderata, also known as "helpfulness", include:

  • Helping the user: Fulfilling the user's request and only refusing if it is impossible to find a policy-compliant response[1].
  • Assuming good intent: Articulating refusals respectfully without making assumptions about user intent[1].

Training for Safety

Safety is integrated into the models through pre- and post-training approaches[1]. This process starts by constructing metrics based on policies and desiderata, typically turned into automated evaluations that guide model development through successive iterations[1]. Data filtering and conditional pre-training, as well as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human and Critic Feedback (RL*F), are employed[1]. Dataset filtering applies safety measures to pre-training data for the strictest policies[1]. Pre-training monitoring includes a novel evaluation to capture the model’s ability to be steered towards different viewpoints and values, which helps align the model at post-training time[1].

Supervised Fine-Tuning (SFT)

For the SFT stage, adversarial prompts are sourced either leveraging existing models and tools to probe Gemini’s attack surface or relying on human interactions to discover potentially harmful behavior[1]. Throughout this process, coverage of the safety policies across common model use cases is strived for[1]. When model behavior needs improvement due to either safety policy violations, or because the model refuses when a helpful, non-policy-violating answer exists, a combination of custom data generation recipes loosely inspired by Constitutional AI, as well as human intervention to revise responses, are used[1]. Automated evaluations on both safety and non-safety metrics are leveraged to monitor impact and potential unintended regressions[1].

Reinforcement Learning from Human and Critic Feedback (RL*F)

Reward signals during RL come from a combination of a Data Reward Model (DRM), which amortizes human preference data, and a Critic, a prompted model that grades responses according to pre-defined rubrics[1]. Interventions are divided into Reward Model and Critic improvements (RM), and reinforcement learning (RL) improvements[1]. Prompts are sourced through human-model or model-model interactions, striving for coverage of safety policies and use cases[1]. Both DRM training, given a prompt set, use custom data generation recipes to surface a representative sample of model responses[1]. Humans then provide feedback on responses, often comparing multiple potential response candidates for each query, and this preference data is amortized in our Data Reward Model[1]. Critics, on the other hand, do not require additional data, and iteration on the grading rubric can be done offline[1]. Similarly to SFT, RLF steers the model away from undesirable behavior, both in terms of content policy violations, and trains the model to be helpful[1]. RLF is accompanied by a number of evaluations that run continuously during training to monitor for safety and other metrics[1].

Automated Red Teaming (ART)

Read More

To complement human red teaming and static evaluations, extensive use of automated red teaming (ART) is made to dynamically evaluate Gemini at scale[1]. This allows to significantly increase coverage and understanding of potential risks, as well as rapidly develop model improvements to make Gemini safer and more helpful[1].

ART is formulated as a multi-agent game between populations of attackers and the target Gemini model being evaluated[1]. Attackers aim to elicit responses from the target model which satisfy some defined objectives (e.g. if the response violates a safety policy, or is unhelpful), and these interactions are scored by various judges (e.g. using a set of policies), with the resulting scores used by the attackers as a reward signal to optimize their attacks[1]. Attackers evaluate Gemini in a black-box setting, using natural language queries without access to the model’s internal parameters, ensuring the automated red teaming is more reflective of real-world use cases and challenges[1]. Attackers are prompted Gemini models, while judges are a mixture of prompted and finetuned Gemini models[1]. This approach has allowed to rapidly scale red teaming to a growing number of areas including policy violations, tone, helpfulness, and neutrality[1].

Security Measures Against Prompt Injection Attacks

Gemini's susceptibility to indirect prompt injection attacks is evaluated, focusing on scenarios in which a third party hides malicious instructions in external retrieved data to manipulate Gemini into taking unauthorized actions through function calling[1]. Function calls available to Gemini allow it to summarize a user’s latest emails and send emails on their behalf[1]. The attacker's objective is to manipulate the model to invoke a send email function call that discreetly exfiltrates sensitive information from conversation history[1]. Several attacks automate the process of generating malicious prompts, including Actor Critic, Beam Search, and Tree of Attacks w/ Pruning (TAP), and after constructing prompt injections using these methods, they're evaluated on a held-out set of synthetic conversation histories containing simulated private user information[1].

Frontier Safety Framework Evaluations

Google DeepMind released its Frontier Safety Framework (FSF) in May 2024 and updated it in February 2025[1]. The FSF comprises a number of processes and evaluations that address risks of severe harm stemming from powerful capabilities of frontier models. It covers four risk domains: CBRN (chemical, biological, radiological and nuclear information risks), cybersecurity, machine learning R&D, and deceptive alignment[1]. The FSF involves the regular evaluation of Google’s frontier models to determine whether they require heightened mitigations, comparing test results against internal alert thresholds (“early warnings”) which are set significantly below the actual Critical Capability Levels (CCLs)[1].

External Safety Testing Program

As part of the External Safety Testing Program, independent external groups help identify areas for improvement in model safety by undertaking structured evaluations, qualitative probing, and unstructured red teaming[1]. Testing is carried out on the most capable Gemini models with the largest capability jumps and external testing groups are given black-box testing access to Gemini on AI Studio for a number of weeks, enabling Google DeepMind to gather early insights into the model’s capabilities and understand if and where mitigations were needed[1]. These groups are selected based on their expertise across a range of domain areas, such as autonomous systems, societal, cyber, and CBRN risks, and are by design instructed to develop their own methodology to test topics within a particular domain area, remaining independent from internal Google DeepMind evaluations[1].