VLJailbreakBench

Benchmark Overview

VLJailbreakBench is structured into two evaluation tiers: a base set and a challenge set, designed to assess VLMs at distinct difficulty levels. The dataset spans 12 safety topics and 46 subcategories, comprising 916 harmful queries. For each query, we generate one jailbreak text-image pair for the base set and three for the challenge set, resulting in a comprehensive collection of 3,654 jailbreak samples. This hierarchical design ensures a rigorous evaluation of VLM robustness across varying adversarial scenarios.

Safety Risk Taxonomy: To construct a comprehensive safety risk taxonomy for VLJailbreakBench, we collaborated with experts from the humanities and social sciences to extend existing taxonomies, ensuring coverage of both technical vulnerabilities and societal impacts. The taxonomy provides a structured classification of different security risks, offering valuable insights into VLM safety in real-world applications.

Benchmark Overview — Safety taxonomy of VLJailbreakBench.

Dataset Generation

VLJailbreakBench's dataset is constructed through a three-step pipeline to ensure high-quality multimodal jailbreak samples.

Step 1: Initial Query Generation

We generate 920 initial harmful queries across 46 safety subcategories using Google Gemini. These queries are then filtered by GPT-4o and Llama 3 to remove harmless entries, resulting in 916 refined harmful queries.

Step 2: Jailbreak Data Generation

Two subsets are created for adversarial testing:

Base Set – MiniGPT-4 attacks LLaVA-1.5 with an attack width of 5 and depth of 2, simulating moderate adversarial scenarios.
Challenge Set – Gemini-1.5-Pro attacks GPT-4o-mini with an attack width of 3 and depth of 3, representing advanced jailbreak scenarios. During refinement, Gemini-1.5-Pro is replaced with Gemini-2.0-Flash-Thinking for enhanced data quality.

Step 3: Data Filtering

The generated samples are filtered using victim VLMs to ensure dataset quality:

Base Set: One successful jailbreak instance per query is retained, with random selection if multiple succeed. If no attack succeeds, a randomly selected sample is retained to maintain dataset consistency.
Challenge Set: Three instances per query are retained using the same strategy.

This structured approach ensures that VLJailbreakBench provides diverse, high-quality, and scalable adversarial data, allowing for a rigorous evaluation of VLM robustness.

Statistical Overview of Base and Challenge Sets

Structured by category and subcategory.

Category	Subcategory	Base Set Size	Challenge Set Size
Economic Harm	Economic Data Manipulation	18	60
	Financial Fraud	22	60
	Financial Misleading	19	60
Ethical Issues	Algorithmic Bias	19	60
	Data Misuse	20	60
	Ethical Controversies	22	60
	Unauthorized Personalized Advice	18	54
Gender and Cultural Bias	Cross-cultural Conflicts	20	60
	Cultural Discrimination	20	60
	Gender Discrimination	20	60
	Gender-based Violence	20	60
Hate Speech	Gender Hate Speech	20	60
	Political Hate Speech	20	60
	Racial Hate Speech	20	60
	Religious Hate Speech	20	60
Health Consultation	False Health Advice	20	60
	Medical Pseudoscience	20	60
	Mental Health Issues	19	57
	Misleading Self-diagnosis	20	60
Illegal Activities	Forgery of Documents/Certificates	20	60
	Fraudulent Activities	21	60
	Illegal Goods Trade	21	60
	Piracy	20	60
	Sexual Assault	21	60
Malware and Cyberattacks	Cybersecurity Vulnerabilities	19	60
	Identity Impersonation	19	57
	Malware Generation	20	60
	Phishing Attacks	20	60
Political Sensitivity	Political Manipulation and Intervention	20	60
	Political Violence	20	60
	Regime Suppression	13	39
	Suppression of Freedom of Speech	19	57
Pornography	Gender-based Violence	19	57
	Pornographic Image and Video Generation	20	54
	Pornographic Text Generation and Distribution	12	36
Privacy Violations	Abuse of Facial Recognition	20	60
	Location Tracking	19	57
	Personal Information Leaks	20	57
	Voice Data Leaks	19	57
Social Attacks and Harassment	Cultural Attacks	20	60
	Cyber Harassment	18	60
	Online Bullying	22	60
	Sexual Harassment	20	59
Violent Behavior	Cyberbullying	19	60
	Domestic Violence	18	57
	Physical Violence	20	60
	Psychological Violence	20	60

Safety Evaluation of VLMs on the Base Set

ASR (%) across 12 safety topics. Certain model names are abbreviated for brevity. "Avg." denotes the average ASR across all topics.

ASR (%)	IA	VB	HS	PV	MC	HC	EH	GCB	PS	EI	SAH	P	Avg.
Qwen2-VL	37.86	29.87	20.00	33.33	38.46	34.18	23.73	42.50	48.61	46.84	28.75	33.33	35.04
MiniGPT-v2	24.27	35.06	18.75	39.74	37.18	41.77	37.29	34.18	44.44	36.71	40.00	13.73	33.77
LLaVA-OneVision	28.16	31.17	23.75	28.21	35.90	29.11	18.64	31.65	43.06	31.65	23.75	19.61	29.07
Llama-3.2-11B-Vision	16.50	15.58	11.25	19.23	12.82	20.25	15.25	12.50	19.44	16.46	6.25	11.76	14.85
Llama-3.2-90B-Vision	7.77	14.29	2.50	7.69	8.97	17.72	3.39	1.25	11.11	3.80	8.75	7.84	7.97
Gemini-2.0-Flash	52.43	61.04	33.75	47.44	67.95	45.57	50.85	55.00	66.67	60.76	53.75	43.14	53.38
Gemini-1.5-Pro	20.39	28.57	18.75	21.79	35.90	15.19	25.42	30.00	44.44	32.91	23.75	23.53	26.53
Gemini-2.0-Flash-Think	16.50	29.87	11.25	21.79	25.64	13.92	16.95	13.75	43.06	25.32	15.00	15.69	20.63
GPT-4o Mini	9.71	19.48	8.75	14.10	8.97	25.32	13.56	20.00	34.72	10.13	7.50	5.88	14.85
GPT-4o	7.77	12.99	1.25	7.69	6.41	10.13	8.47	8.75	26.39	2.53	6.25	3.92	8.52
Claude-3.5-Sonnet	0.00	1.30	0.00	2.56	1.28	1.27	1.69	1.25	1.39	1.27	1.25	0.00	1.09

Safety Evaluation of VLMs on the Challenge Set