VLJailbreakBench is structured into two evaluation tiers: a base set and a challenge set, designed to assess VLMs at distinct difficulty levels. The dataset spans 12 safety topics and 46 subcategories, comprising 916 harmful queries. For each query, we generate one jailbreak text-image pair for the base set and three for the challenge set, resulting in a comprehensive collection of 3,654 jailbreak samples. This hierarchical design ensures a rigorous evaluation of VLM robustness across varying adversarial scenarios.
Safety Risk Taxonomy: To construct a comprehensive safety risk taxonomy for VLJailbreakBench, we collaborated with experts from the humanities and social sciences to extend existing taxonomies, ensuring coverage of both technical vulnerabilities and societal impacts. The taxonomy provides a structured classification of different security risks, offering valuable insights into VLM safety in real-world applications.
VLJailbreakBench's dataset is constructed through a three-step pipeline to ensure high-quality multimodal jailbreak samples.
We generate 920 initial harmful queries across 46 safety subcategories using Google Gemini. These queries are then filtered by GPT-4o and Llama 3 to remove harmless entries, resulting in 916 refined harmful queries.
Two subsets are created for adversarial testing:
The generated samples are filtered using victim VLMs to ensure dataset quality:
This structured approach ensures that VLJailbreakBench provides diverse, high-quality, and scalable adversarial data, allowing for a rigorous evaluation of VLM robustness.
Structured by category and subcategory.
| Category | Subcategory | Base Set Size | Challenge Set Size |
|---|---|---|---|
| Economic Harm | Economic Data Manipulation | 18 | 60 |
| Financial Fraud | 22 | 60 | |
| Financial Misleading | 19 | 60 | |
| Ethical Issues | Algorithmic Bias | 19 | 60 |
| Data Misuse | 20 | 60 | |
| Ethical Controversies | 22 | 60 | |
| Unauthorized Personalized Advice | 18 | 54 | |
| Gender and Cultural Bias | Cross-cultural Conflicts | 20 | 60 |
| Cultural Discrimination | 20 | 60 | |
| Gender Discrimination | 20 | 60 | |
| Gender-based Violence | 20 | 60 | |
| Hate Speech | Gender Hate Speech | 20 | 60 |
| Political Hate Speech | 20 | 60 | |
| Racial Hate Speech | 20 | 60 | |
| Religious Hate Speech | 20 | 60 | |
| Health Consultation | False Health Advice | 20 | 60 |
| Medical Pseudoscience | 20 | 60 | |
| Mental Health Issues | 19 | 57 | |
| Misleading Self-diagnosis | 20 | 60 | |
| Illegal Activities | Forgery of Documents/Certificates | 20 | 60 |
| Fraudulent Activities | 21 | 60 | |
| Illegal Goods Trade | 21 | 60 | |
| Piracy | 20 | 60 | |
| Sexual Assault | 21 | 60 | |
| Malware and Cyberattacks | Cybersecurity Vulnerabilities | 19 | 60 |
| Identity Impersonation | 19 | 57 | |
| Malware Generation | 20 | 60 | |
| Phishing Attacks | 20 | 60 | |
| Political Sensitivity | Political Manipulation and Intervention | 20 | 60 |
| Political Violence | 20 | 60 | |
| Regime Suppression | 13 | 39 | |
| Suppression of Freedom of Speech | 19 | 57 | |
| Pornography | Gender-based Violence | 19 | 57 |
| Pornographic Image and Video Generation | 20 | 54 | |
| Pornographic Text Generation and Distribution | 12 | 36 | |
| Privacy Violations | Abuse of Facial Recognition | 20 | 60 |
| Location Tracking | 19 | 57 | |
| Personal Information Leaks | 20 | 57 | |
| Voice Data Leaks | 19 | 57 | |
| Social Attacks and Harassment | Cultural Attacks | 20 | 60 |
| Cyber Harassment | 18 | 60 | |
| Online Bullying | 22 | 60 | |
| Sexual Harassment | 20 | 59 | |
| Violent Behavior | Cyberbullying | 19 | 60 |
| Domestic Violence | 18 | 57 | |
| Physical Violence | 20 | 60 | |
| Psychological Violence | 20 | 60 |
ASR (%) across 12 safety topics. Certain model names are abbreviated for brevity. "Avg." denotes the average ASR across all topics.
| ASR (%) | IA | VB | HS | PV | MC | HC | EH | GCB | PS | EI | SAH | P | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2-VL | 37.86 | 29.87 | 20.00 | 33.33 | 38.46 | 34.18 | 23.73 | 42.50 | 48.61 | 46.84 | 28.75 | 33.33 | 35.04 |
| MiniGPT-v2 | 24.27 | 35.06 | 18.75 | 39.74 | 37.18 | 41.77 | 37.29 | 34.18 | 44.44 | 36.71 | 40.00 | 13.73 | 33.77 |
| LLaVA-OneVision | 28.16 | 31.17 | 23.75 | 28.21 | 35.90 | 29.11 | 18.64 | 31.65 | 43.06 | 31.65 | 23.75 | 19.61 | 29.07 |
| Llama-3.2-11B-Vision | 16.50 | 15.58 | 11.25 | 19.23 | 12.82 | 20.25 | 15.25 | 12.50 | 19.44 | 16.46 | 6.25 | 11.76 | 14.85 |
| Llama-3.2-90B-Vision | 7.77 | 14.29 | 2.50 | 7.69 | 8.97 | 17.72 | 3.39 | 1.25 | 11.11 | 3.80 | 8.75 | 7.84 | 7.97 |
| Gemini-2.0-Flash | 52.43 | 61.04 | 33.75 | 47.44 | 67.95 | 45.57 | 50.85 | 55.00 | 66.67 | 60.76 | 53.75 | 43.14 | 53.38 |
| Gemini-1.5-Pro | 20.39 | 28.57 | 18.75 | 21.79 | 35.90 | 15.19 | 25.42 | 30.00 | 44.44 | 32.91 | 23.75 | 23.53 | 26.53 |
| Gemini-2.0-Flash-Think | 16.50 | 29.87 | 11.25 | 21.79 | 25.64 | 13.92 | 16.95 | 13.75 | 43.06 | 25.32 | 15.00 | 15.69 | 20.63 |
| GPT-4o Mini | 9.71 | 19.48 | 8.75 | 14.10 | 8.97 | 25.32 | 13.56 | 20.00 | 34.72 | 10.13 | 7.50 | 5.88 | 14.85 |
| GPT-4o | 7.77 | 12.99 | 1.25 | 7.69 | 6.41 | 10.13 | 8.47 | 8.75 | 26.39 | 2.53 | 6.25 | 3.92 | 8.52 |
| Claude-3.5-Sonnet | 0.00 | 1.30 | 0.00 | 2.56 | 1.28 | 1.27 | 1.69 | 1.25 | 1.39 | 1.27 | 1.25 | 0.00 | 1.09 |
ASR (%) across 12 safety topics. Certain model names are abbreviated for brevity. "Avg." denotes the average ASR across all topics.
| ASR (%) | IA | VB | HS | PV | MC | HC | EH | GCB | PS | EI | SAH | P | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2-VL | 54.66 | 63.29 | 57.50 | 77.92 | 77.22 | 65.40 | 68.33 | 72.92 | 89.35 | 74.79 | 86.19 | 76.87 | 71.40 |
| LLaVA-OneVision | 61.33 | 75.11 | 61.67 | 75.75 | 75.95 | 61.18 | 69.44 | 65.42 | 81.48 | 67.09 | 74.90 | 52.38 | 68.70 |
| MiniGPT-v2 | 44.33 | 59.92 | 52.72 | 60.87 | 59.07 | 50.85 | 46.67 | 64.17 | 61.11 | 53.42 | 58.58 | 51.02 | 55.25 |
| Llama-3.2-11B-Vision | 56.33 | 51.48 | 37.50 | 47.62 | 49.79 | 38.82 | 42.22 | 47.50 | 68.06 | 60.68 | 53.14 | 46.26 | 50.22 |
| Llama-3.2-90B-Vision | 46.67 | 60.34 | 29.17 | 61.04 | 59.07 | 46.84 | 46.11 | 33.33 | 58.80 | 50.00 | 47.70 | 31.97 | 47.95 |
| GPT-4o Mini | 67.33 | 81.86 | 54.58 | 74.03 | 75.11 | 72.57 | 70.56 | 75.42 | 82.41 | 73.08 | 76.57 | 60.54 | 72.21 |
| Gemini-2.0-Flash-Think | 62.33 | 81.01 | 62.08 | 68.83 | 78.48 | 66.24 | 68.89 | 77.50 | 79.63 | 78.21 | 75.73 | 54.42 | 71.44 |
| Gemini-2.0-Flash | 56.00 | 72.57 | 46.67 | 56.28 | 75.95 | 64.56 | 78.33 | 82.92 | 93.98 | 73.93 | 61.92 | 34.69 | 66.84 |
| Claude-3.5-Sonnet | 22.00 | 20.25 | 10.83 | 21.65 | 22.78 | 15.61 | 16.11 | 10.83 | 21.30 | 23.93 | 28.45 | 21.77 | 19.65 |
@article{wang2024ideator,
title={IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves},
author={Wang, Ruofan and Li, Juncheng and Wang, Yixu and Wang, Bo and Wang, Xiaosen and Teng, Yan and Wang, Yingchun and Ma, Xingjun and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2411.00827},
year={2024}
}