Skip to content

This Repo provides code for evaluating LLMs on LEXam. LEXam is a comprehensive benchmark evaluating AI system's legal reasoning ability with law exam questions. It has two subsets of open questions and multiple-choice questions.

License

Apache-2.0, CC-BY-4.0 licenses found

Licenses found

Apache-2.0
LICENSE
CC-BY-4.0
LICENSE_DATA
Notifications You must be signed in to change notification settings

LEXam-Benchmark/LEXam

LEXam Logo

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

A diverse, rigorous evaluation suite for legal AI from Swiss, EU, and international law examinations.

This repository provides code for evaluating LLMs on LEXam.

Homepage Data arXiv license Data License

🚀🔄 Reproducing Paper results or Evaluating your own LLM

Environment Preparation

git clone https://github.com/LEXam-Benchmark/LEXam
cd LEXam
conda create -n lexam python=3.11
conda activate lexam
cd lighteval
pip install -e .[dev]
cd ..
pip install -r requirements.txt

# Set API keys for inference and evaluation.
# OpenAI key is mandatory for our expert-verified grader, which is based on GPT-4o
EXPORT OPENAI_API_KEY="xxx"
EXPORT TOGETHER_API_KEY="xxx"
EXPORT DEEPSEEK_API_KEY="xxx"
EXPORT ANTHROPIC_API_KEY="xxx"
EXPORT GEMINI_API_KEY="xxx"

Evaluating Non-Reasoning LLMs with [Huggingface lighteval]

Huggingface lighteval provides the advantage of uniformly evaluating LLMs from different endpoints -- local vLLM, OpenAI, Anthropic, TogetherAI, Gemini ...

Together-AI, OpenAI, Gemini, and other API-based LLMs can be evaluated by:

MODEL="openai/gpt-4o-mini-2024-07-18" 

# Evaluating GPT-4o-mini on LEXam Open Question subset.
python -m lighteval endpoint litellm "${MODEL}" "community|lexamoq_open_question|0|0" --custom-tasks lighteval/community_tasks/lexam_oq_evals.py --output-dir outputs_oq --save-details --use-chat-template

# Evaluating GPT-4o-mini on LEXam Multiple-Choice Question subset.
python -m lighteval endpoint litellm "${MODEL}" "community|lexammcq_mcq_4_choices|0|0" --custom-tasks lighteval/community_tasks/lexam_mcq_evals.py --output-dir outputs_mcq --save-details --use-chat-template
  • MODEL: the target LLM you are evaluating, e.g., openai/gpt-4.1, together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
  • --output-dir: evaluation results will be saved to --output-dir.
  • --save-details: details including prompts, LLM responses, LLM judges, and other evaluation metrics will be saved in details.

Local inference using vLLM:

MODEL="meta-llama/Llama-3.1-8B-Instruct" 
export HF_HOME="xxx"
export HUGGINGFACE_TOKEN="xxx"
huggingface-cli login --token $HUGGINGFACE_TOKEN

# Evaluating GPT-4o-mini on LEXam Open Question subset.
python -m lighteval vllm "pretrained=${MODEL},trust_remote_code=True,dtype=bfloat16" "community|lexamoq_open_question|0|0" --custom-tasks lighteval/community_tasks/lexam_oq_evals.py --output-dir outputs_oq --save-details --use-chat-template

# Evaluating GPT-4o-mini on LEXam Multiple-Choice Question subset.
python -m lighteval vllm "pretrained=${MODEL},trust_remote_code=True,dtype=bfloat16" "community|lexammcq_mcq_4_choices|0|0" --custom-tasks lighteval/community_tasks/lexam_mcq_evals.py --output-dir outputs_mcq --save-details --use-chat-template

Evaluating Reasoning LLMs with LiteLLM directly.

Reasoning LLMs generate both a scratch pad and the final answer after . To only evaluate the answer, we do not use lighteval for reasoning LLMs.

MODEL="deepseek-reasoner"
python litellm_eval.py --input_file data/open_questions_test.xlsx --cache_name r1 --llm $MODEL --output_file lexam_oq_${MODEL}.csv --batch_size 2 --task_type open_quesitons
python litellm_eval.py --input_file data/MCQs_test.xlsx --cache_name r1 --llm $MODEL --output_file lexam_mcq_${MODEL}.csv --batch_size 2 --answer_field gold --task_type mcq_letters
  • MODEL can be set to any model included in MODEL_DICT of litellm_eval.py, e.g., o1, o3-mini, qwq-32b.
  • --output_file: DeepSeek-R1's answer to open/MC questions will be at lexam_oq_deepseek-reasoner.csv and lexam_mcq_deepseek-reasoner.csv
  • --task_type: chose from ['mcq_letters', 'mcq_numbers', 'open_questions']. mcq_letters and _numbers differ by using ABCD or 1234 as choice labels.

Then evaluate the answers using our expert-verified LLM judge. This script will print the Mean and bootstrapped Variance of open question performance.

MODEL="deepseek-reasoner"
python customized_judge_async.py --input_file lexam_oq_${MODEL}.csv --output_file lexam_oq_${MODEL}_graded.csv --async_call --cache_name gpt4o --llm gpt-4o
  • --input_file: Grade DeepSeek-R1's answer to open questions. Grading results at lexam_oq_deepseek-reasoner_graded.csv

Finally evaluate the accuracy of MCQs. This script will print accuracy and bootstrapped variance. No LLM call is involved in this script.

MODEL="deepseek-reasoner"
INPUT_FILE="lexam_mcq_${MODEL}.csv"
python evaluation.py --input_file $INPUT_FILE --response_field ${MODEL}_answer --task_type mcq_letters

Licenses

Citation

If you find the dataset helpful, please consider citing LEXam:

@article{fan2025lexam,
  title={LEXam: Benchmarking Legal Reasoning on 340 Law Exams},
  author={Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang and Hermstr{\"u}wer, Yoan and Huang, Yinya and Akhtar, Mubashara and Salimbeni, Etienne and Geering, Florian and Dreyer, Oliver and Brunner, Daniel and Leippold, Markus and Sachan, Mrinmaya and Stremitzer, Alexander and Engel, Christoph and Ash, Elliott and Niklaus, Joel},
  journal={arXiv preprint arXiv:2505.12864},
  year={2025}
}

About

This Repo provides code for evaluating LLMs on LEXam. LEXam is a comprehensive benchmark evaluating AI system's legal reasoning ability with law exam questions. It has two subsets of open questions and multiple-choice questions.

Resources

License

Apache-2.0, CC-BY-4.0 licenses found

Licenses found

Apache-2.0
LICENSE
CC-BY-4.0
LICENSE_DATA

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published