CodeAttack is one of the strongest jailbreak methods for LLMs so far.
- An enhanced version of CodeAttack, highly effective against the latest
GPT-4andClaude-3series models, will be available next week!
For more details, please refer to our paper ACL 2024.
✨An example run:
codeattack --num-sample 1 \
--prompt-type python_stack_plus \
--target-model=gpt-4o \
--judge-model=gpt-4o \
--exp-name=main \
--target-max-n-tokens=1000 \
--multi-thread \
--temperature 0 \
--start-idx 0 --end-idx -1
There are 2 data folders in thecodeattack package:
- The
datasetfoler contains the datasetharmful_behaviors.csvfrom AdvBench. This is the default dataset used bycodeattack. - The
prompt_templatesfolder provides templates for our CodeAttack in Python, C++ and Go. We recommend using prompt_template with "plus" to get a more detailed model response.
Plus, a third folder in the repository root:
3. The prompts folder contains the adversarial prompts generated by our CodeAttack.
For convenience, we include three versions of CodeAttack curated using AdvBench: data_python_string_full.json, data_python_list_full.json, and data_python_stack_full.json.
CodeAttack constructs a code template with three steps: (1) Input encoding which encodes the harmful text-based query with common data structures; (2) Task understanding which applies a decode() function to allow LLMs to extract the target task from various kinds of inputs; (3) Output specification which enables LLMs to fill the output structure with the user’s desired content.
If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
@inproceedings{
Ren2024codeattack,
title={Exploring Safety Generalization Challenges of Large Language Models via Code},
author={Qibing Ren and Chang Gao and Jing Shao and Junchi Yan and Xin Tan and Wai Lam and Lizhuang Ma},
booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},
year={2024},
url={https://arxiv.org/abs/2403.07865}
}

