ContraLegal AI is an autonomous legal intelligence platform engineered to transform unstructured contract data into actionable risk distributions. By synthesizing Legal-BERT transformer architectures with Retrieval-Augmented Generation (RAG), the system provides granular multi-class risk scoring, automated redrafting, and spatial PDF highlights to eliminate manual bottlenecks in enterprise legal review.
| Capability | Orchestration | Technical Specification |
|---|---|---|
| Trimodal Classification | Legal-BERT | High, Medium, and Low-intensity risk granularity |
| Dynamic Scoring | Hybrid Logic | Fusion of Transformer probabilities and deterministic keyword heuristics |
| Explainable AI | RAG + LangChain | Root-cause analysis of flagged clauses in professional nomenclature |
| Strategic Redrafting | Generative AI | Automated generation of balanced, legally-sound alternative phrasing |
| Conversational Querying | FAISS Vector Store | Real-time, document-grounded Q&A for complex legal inquiries |
| Spatial Annotation | PyMuPDF API | Physical coordinate-to-text mapping for in-situ PDF highlighting |
| Relational Data Export | openpyxl | Structured synthesis of risk distributions in Excel and CSV formats |
| Thematic Clustering | Scikit-Learn | Unsupervised K-Means grouping of obligation-specific clauses |
The platform is built upon a high-concurrency architecture with a strict separation of concerns across research and deployment layers.
- Fine-tuned the
nlpaueb/legal-bert-base-uncasedtransformer using a weighted-trainer objective for imbalanced class distribution. - Engineered the 3-class quantitative heuristic for synthetic label generation spanning over 21,000 samples.
- Developed the formal ablation study and multi-class ROC-AUC evaluation suite to validate transformer superiority over statistical baselines.
- Architected the retrieval-augmented generation pipeline utilizing FAISS for vectorized similarity search.
- Engineered the LLM Provider Factory, enabling seamless interoperability between Google Gemini, Groq, and OpenAI.
- Validated prompt-engineering strategies for deterministic clause synthesis and document-grounded conversational flows.
- Engineered the spatial highlighting engine using PyMuPDF to perform physical document marking via bounding-box coordinate tracking.
- Implemented semantic document segmentation to optimize transformer context windows.
- Architected the automated CI/CD infrastructure via GitHub Actions for continuous environment validation.
The integration of transformer architectures resulted in a fundamental shift in both classification precision and recall intensity.
| Metrical Indicator | Random Forest Baseline | Legal-BERT Transformer | Improvement (Δ) |
|---|---|---|---|
| Accuracy | 94.44% | 97.01% | +2.57% |
| Weighted F1 | 0.9441 | 0.9702 | +2.76% |
| Macro F1 | 0.8901 | 0.9371 | +5.28% |
| ROC-AUC (Macro) | 0.9870 | 0.9948 | +0.79% |
| High Risk Recall | 73.96% | 85.94% | +11.98% |
Note: The 11.98% surge in High Risk Recall represents the most critical engineering milestone, ensuring safety in mission-critical legal review.
ContraLegal-AI/
├── app.py # Streamlit Production Environment
├── .github/workflows/ # Automated CI/CD (python-app.yml)
├── src/
│ ├── model_trainer.py # Phase-integrated Training Orchestrator
│ ├── data_pipeline/ # Semantic Extraction & Normalization
│ ├── inference/
│ │ ├── predictor.py # Trimodal Detection Engine (BERT/RF)
│ │ ├── llm_engine.py # RAG Orchestrator & Conversational Layer
│ │ └── keyword_engine.py # Deterministic Rule Definitions
│ ├── model/
│ │ ├── bert_trainer.py # Transformer Fine-tuning Suite
│ │ └── evaluator.py # Quantitative Performance Metrics
│ └── utils/
│ └── pdf_annotator.py # Spatial Coordinate Highlighting
├── models/
│ ├── legal_bert/ # Fine-tuned Weights (nlpaueb)
│ └── ablation_study.png # Baseline vs. Transformer Visualization
├── notebooks/
│ └── train_legal_bert_colab.py # GPU-accelerated Training Script
└── report/
├── report.pdf # Formally Published IEEE Paper
└── report.tex # Scientific Manuscript Source
git clone https://github.com/AyushCoder9/ContraLegal-AI.git
pip install -r requirements.txtTo initiate the production dashboard with the global pre-trained model:
streamlit run app.pyTo execute the full analytical pipeline and regenerate performance artifacts:
python -m src.model_trainerThe technical methodology, algorithmic decisions, and empirical evaluations are documented in the associated IEEE conference-format manuscript located in the report/ directory.
Engineered for Legal Precision.