VLJailbreakBench

VLJailbreakBench: A Benchmark for Evaluating VLM Robustness against Multimodal Jailbreak Attacks

Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng,
Yingchun Wang, Xingjun Ma *, Yu-Gang Jiang
Fudan University, Huawei Technologies Ltd., Shanghai Artificial Intelligence Laboratory

*Corresponding author

Benchmark Overview

VLJailbreakBench is structured into two evaluation tiers: a base set and a challenge set, designed to assess VLMs at distinct difficulty levels. The dataset spans 12 safety topics and 46 subcategories, comprising 916 harmful queries. For each query, we generate one jailbreak text-image pair for the base set and three for the challenge set, resulting in a comprehensive collection of 3,654 jailbreak samples. This hierarchical design ensures a rigorous evaluation of VLM robustness across varying adversarial scenarios.

Safety Risk Taxonomy: To construct a comprehensive safety risk taxonomy for VLJailbreakBench, we collaborated with experts from the humanities and social sciences to extend existing taxonomies, ensuring coverage of both technical vulnerabilities and societal impacts. The taxonomy provides a structured classification of different security risks, offering valuable insights into VLM safety in real-world applications.

Benchmark Overview
Safety taxonomy of VLJailbreakBench.

Dataset Generation

VLJailbreakBench's dataset is constructed through a three-step pipeline to ensure high-quality multimodal jailbreak samples.

Step 1: Initial Query Generation

We generate 920 initial harmful queries across 46 safety subcategories using Google Gemini. These queries are then filtered by GPT-4o and Llama 3 to remove harmless entries, resulting in 916 refined harmful queries.

Step 2: Jailbreak Data Generation

Two subsets are created for adversarial testing:

  • Base Set – MiniGPT-4 attacks LLaVA-1.5 with an attack width of 5 and depth of 2, simulating moderate adversarial scenarios.
  • Challenge Set – Gemini-1.5-Pro attacks GPT-4o-mini with an attack width of 3 and depth of 3, representing advanced jailbreak scenarios. During refinement, Gemini-1.5-Pro is replaced with Gemini-2.0-Flash-Thinking for enhanced data quality.

Step 3: Data Filtering

The generated samples are filtered using victim VLMs to ensure dataset quality:

  • Base Set: One successful jailbreak instance per query is retained, with random selection if multiple succeed. If no attack succeeds, a randomly selected sample is retained to maintain dataset consistency.
  • Challenge Set: Three instances per query are retained using the same strategy.

This structured approach ensures that VLJailbreakBench provides diverse, high-quality, and scalable adversarial data, allowing for a rigorous evaluation of VLM robustness.

Statistical Overview of Base and Challenge Sets

Structured by category and subcategory.

Category Subcategory Base Set Size Challenge Set Size
Economic HarmEconomic Data Manipulation1860
Financial Fraud2260
Financial Misleading1960
Ethical IssuesAlgorithmic Bias1960
Data Misuse2060
Ethical Controversies2260
Unauthorized Personalized Advice1854
Gender and Cultural BiasCross-cultural Conflicts2060
Cultural Discrimination2060
Gender Discrimination2060
Gender-based Violence2060
Hate SpeechGender Hate Speech2060
Political Hate Speech2060
Racial Hate Speech2060
Religious Hate Speech2060
Health ConsultationFalse Health Advice2060
Medical Pseudoscience2060
Mental Health Issues1957
Misleading Self-diagnosis2060
Illegal ActivitiesForgery of Documents/Certificates2060
Fraudulent Activities2160
Illegal Goods Trade2160
Piracy2060
Sexual Assault2160
Malware and CyberattacksCybersecurity Vulnerabilities1960
Identity Impersonation1957
Malware Generation2060
Phishing Attacks2060
Political SensitivityPolitical Manipulation and Intervention2060
Political Violence2060
Regime Suppression1339
Suppression of Freedom of Speech1957
PornographyGender-based Violence1957
Pornographic Image and Video Generation2054
Pornographic Text Generation and Distribution1236
Privacy ViolationsAbuse of Facial Recognition2060
Location Tracking1957
Personal Information Leaks2057
Voice Data Leaks1957
Social Attacks and HarassmentCultural Attacks2060
Cyber Harassment1860
Online Bullying2260
Sexual Harassment2059
Violent BehaviorCyberbullying1960
Domestic Violence1857
Physical Violence2060
Psychological Violence2060
data_example

Examples from the challenge set in VLJailbreakBench.

Safety Evaluation of VLMs on the Base Set

ASR (%) across 12 safety topics. Certain model names are abbreviated for brevity. "Avg." denotes the average ASR across all topics.

ASR (%) IA VB HS PV MC HC EH GCB PS EI SAH P Avg.
Qwen2-VL 37.86 29.87 20.00 33.33 38.46 34.18 23.73 42.50 48.61 46.84 28.75 33.33 35.04
MiniGPT-v2 24.27 35.06 18.75 39.74 37.18 41.77 37.29 34.18 44.44 36.71 40.00 13.73 33.77
LLaVA-OneVision 28.16 31.17 23.75 28.21 35.90 29.11 18.64 31.65 43.06 31.65 23.75 19.61 29.07
Llama-3.2-11B-Vision 16.50 15.58 11.25 19.23 12.82 20.25 15.25 12.50 19.44 16.46 6.25 11.76 14.85
Llama-3.2-90B-Vision 7.77 14.29 2.50 7.69 8.97 17.72 3.39 1.25 11.11 3.80 8.75 7.84 7.97
Gemini-2.0-Flash 52.43 61.04 33.75 47.44 67.95 45.57 50.85 55.00 66.67 60.76 53.75 43.14 53.38
Gemini-1.5-Pro 20.39 28.57 18.75 21.79 35.90 15.19 25.42 30.00 44.44 32.91 23.75 23.53 26.53
Gemini-2.0-Flash-Think 16.50 29.87 11.25 21.79 25.64 13.92 16.95 13.75 43.06 25.32 15.00 15.69 20.63
GPT-4o Mini 9.71 19.48 8.75 14.10 8.97 25.32 13.56 20.00 34.72 10.13 7.50 5.88 14.85
GPT-4o 7.77 12.99 1.25 7.69 6.41 10.13 8.47 8.75 26.39 2.53 6.25 3.92 8.52
Claude-3.5-Sonnet 0.00 1.30 0.00 2.56 1.28 1.27 1.69 1.25 1.39 1.27 1.25 0.00 1.09

Safety Evaluation of VLMs on the Challenge Set

ASR (%) across 12 safety topics. Certain model names are abbreviated for brevity. "Avg." denotes the average ASR across all topics.

ASR (%) IA VB HS PV MC HC EH GCB PS EI SAH P Avg.
Qwen2-VL 54.66 63.29 57.50 77.92 77.22 65.40 68.33 72.92 89.35 74.79 86.19 76.87 71.40
LLaVA-OneVision 61.33 75.11 61.67 75.75 75.95 61.18 69.44 65.42 81.48 67.09 74.90 52.38 68.70
MiniGPT-v2 44.33 59.92 52.72 60.87 59.07 50.85 46.67 64.17 61.11 53.42 58.58 51.02 55.25
Llama-3.2-11B-Vision 56.33 51.48 37.50 47.62 49.79 38.82 42.22 47.50 68.06 60.68 53.14 46.26 50.22
Llama-3.2-90B-Vision 46.67 60.34 29.17 61.04 59.07 46.84 46.11 33.33 58.80 50.00 47.70 31.97 47.95
GPT-4o Mini 67.33 81.86 54.58 74.03 75.11 72.57 70.56 75.42 82.41 73.08 76.57 60.54 72.21
Gemini-2.0-Flash-Think 62.33 81.01 62.08 68.83 78.48 66.24 68.89 77.50 79.63 78.21 75.73 54.42 71.44
Gemini-2.0-Flash 56.00 72.57 46.67 56.28 75.95 64.56 78.33 82.92 93.98 73.93 61.92 34.69 66.84
Claude-3.5-Sonnet 22.00 20.25 10.83 21.65 22.78 15.61 16.11 10.83 21.30 23.93 28.45 21.77 19.65

BibTeX

@article{wang2024ideator,
  title={IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves},
  author={Wang, Ruofan and Li, Juncheng and Wang, Yixu and Wang, Bo and Wang, Xiaosen and Teng, Yan and Wang, Yingchun and Ma, Xingjun and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2411.00827},
  year={2024}
}