apache · kubraaksux · Jan 19, 2026 · Feb 12, 2026 · Feb 13, 2026 · Feb 13, 2026
diff --git a/.gitignore b/.gitignore
@@ -156,4 +156,3 @@ docker/mountFolder/*.bin
 docker/mountFolder/*.bin.mtd
 
 SEAL-*/
-
diff --git a/scripts/staging/llm-bench/.gitignore b/scripts/staging/llm-bench/.gitignore
@@ -0,0 +1,34 @@
+# Benchmark outputs (committed for project submission)
+# results/
+
+# Python
+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+.eggs/
+
+# Virtual environment
+.venv/
+venv/
+env/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+
+# Environment variables
+.env
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Reports (committed for project submission)
+# *.html
+!templates/*.html
+
+# Dataset cache
+.cache/
diff --git a/scripts/staging/llm-bench/PR_DESCRIPTION.md b/scripts/staging/llm-bench/PR_DESCRIPTION.md
@@ -0,0 +1,150 @@
+Benchmarking framework that compares LLM inference across four backends: OpenAI API, Ollama, vLLM, and SystemDS JMLC with the native `llmPredict` built-in. Evaluated on 5 workloads (math, reasoning, summarization, JSON extraction, embeddings) with n=50 per workload.
+
+## Purpose and motivation
+
+This project was developed as part of the LDE (Large-Scale Data Engineering) course. The `llmPredict` native built-in was added to SystemDS in PR #2430. This PR (#2431) contains the **benchmarking framework** that evaluates `llmPredict` against established LLM serving solutions, plus the benchmark results.
+
+Research questions:
+
+1. How does SystemDS's `llmPredict` built-in compare to dedicated LLM backends (OpenAI, Ollama, vLLM) in terms of accuracy and throughput?
+2. How does Java-side concurrent request dispatch scale with the `llmPredict` instruction?
+3. What is the cost-performance tradeoff across cloud APIs, local CPU inference, and GPU-accelerated backends?
+
+Approach:
+
+- Built a Python benchmarking framework that runs standardized workloads against all four backends under identical conditions (same prompts, same evaluation metrics)
+- The `llmPredict` built-in (from PR #2430) goes through the full DML compilation pipeline (parser → hops → lops → CP instruction) and makes HTTP calls to any OpenAI-compatible inference server
+- Ran evaluation in two phases: (1) sequential baseline across all backends, (2) SystemDS with Java-side concurrency (`ExecutorService` thread pool in the `llmPredict` instruction)
+- GPU backends (vLLM, SystemDS) executed on NVIDIA H100 PCIe (81GB). Ollama ran on local MacBook (CPU). OpenAI ran on local MacBook calling cloud API. All runs used 50 samples per workload, temperature=0.0 for reproducibility.
+
+## Project structure
+
+```
+scripts/staging/llm-bench/
+├── runner.py                  # Main benchmark runner (CLI entry point)
+├── backends/
+│   ├── openai_backend.py      # OpenAI API (gpt-4.1-mini)
+│   ├── ollama_backend.py      # Ollama local server (llama3.2)
+│   ├── vllm_backend.py        # vLLM serving engine (streaming HTTP)
+│   └── systemds_backend.py    # SystemDS JMLC via Py4J + llmPredict DML
+├── workloads/
+│   ├── math/                  # GSM8K dataset, numerical accuracy
+│   ├── reasoning/             # BoolQ dataset, logical accuracy
+│   ├── summarization/         # XSum dataset, ROUGE-1 scoring
+│   ├── json_extraction/       # Built-in structured extraction
+│   └── embeddings/            # STS-Benchmark, similarity scoring
+├── evaluation/
+│   └── perf.py                # Latency, throughput metrics
+├── scripts/
+│   ├── report.py              # HTML report generator
+│   ├── aggregate.py           # Cross-run aggregation
+│   └── run_all_benchmarks.sh  # Batch automation (all backends, all workloads)
+├── results/                   # Benchmark outputs (metrics.json per run)
+└── tests/                     # Unit tests for accuracy checks + runner
+```
+
+Note: The `llmPredict` built-in implementation (Java pipeline files) is in PR #2430. This PR includes the benchmark framework and results only. Some `llmPredict` code appears in this diff because both branches share the same local repository.
+
+## Backends
+
+| Backend | Type | Model | Hardware | Inference path |
+|---------|------|-------|----------|----------------|
+| OpenAI | Cloud API | gpt-4.1-mini | MacBook (API call) | Python HTTP to OpenAI servers |
+| Ollama | Local server | llama3.2 (3B) | MacBook CPU | Python HTTP to local Ollama |
+| vLLM | GPU server | Qwen2.5-3B-Instruct | NVIDIA H100 | Python streaming HTTP to vLLM engine |
+| vLLM | GPU server | Mistral-7B-Instruct | NVIDIA H100 | Python streaming HTTP to vLLM engine |
+| SystemDS | JMLC API | Qwen2.5-3B-Instruct | NVIDIA H100 | Py4J → JMLC → DML `llmPredict` → Java HTTP → vLLM |
+
+SystemDS and vLLM Qwen 3B use the **same model on the same vLLM inference server**, making their accuracy directly comparable. Any accuracy difference comes from the serving path, not the model.
+
+## Benchmark results
+
+### Evaluation methodology
+
+Each workload defines its own `accuracy_check(prediction, reference)` function that returns true/false per sample. The accuracy percentage is `correct_count / n`. All accuracy counts were verified against raw `samples.jsonl` files and reproduced locally.
+
+| Workload | Criterion | How it works |
+|----------|-----------|-------------|
+| math | Exact numerical match | Extracts the final number from the model's chain-of-thought response using regex patterns (explicit markers like `####`, `\boxed{}`, bold `**N**`, or the last number in the text). Compares against the GSM8K reference answer. Passes if `abs(predicted - reference) < 1e-6`. |
+| reasoning | Extracted answer match | Extracts yes/no or text answer from the response using CoT markers ("answer is X", "therefore X") or the last short line. Compares against BoolQ reference using exact match, word-boundary substring match, or numeric comparison. |
+| summarization | ROUGE-1 F1 >= 0.2 | Computes ROUGE-1 F1 score between the generated summary and the XSum reference using the `rouge-score` library with stemming. A threshold of 0.2 means the summary shares at least 20% unigram overlap (F1) with the reference. Predictions shorter than 10 characters are rejected. |
+| json_extraction | >= 90% fields match | Parses JSON from the model response (tries direct parse, markdown code fences, regex). Checks that all required fields from the reference are present. Values compared with strict matching: case-insensitive for strings, exact for numbers/booleans. Passes if at least 90% of field values match. |
+| embeddings | Score within 1.0 of reference | The model rates sentence-pair similarity on a 0-5 STS scale. The predicted score is extracted from the response. Passes if `abs(predicted - reference) <= 1.0` (20% tolerance). This is standard for STS-B evaluation. |
+
+### Accuracy (% correct, n=50 per workload)
+
+| Workload | Ollama llama3.2 3B | OpenAI gpt-4.1-mini | vLLM Qwen 3B | SystemDS Qwen 3B c=1 | SystemDS Qwen 3B c=4 | vLLM Mistral 7B |
+|---|---|---|---|---|---|---|
+| math | 58% | 94% | 68% | 68% | 68% | 38% |
+| json_extraction | 74% | 84% | 52% | 52% | 52% | 50% |
+| reasoning | 44% | 70% | 60% | 60% | 64% | 68% |
+| summarization | 80% | 88% | 50% | 50% | 62% | 68% |
+| embeddings | 40% | 88% | 90% | 90% | 90% | 82% |
+
+### Key comparisons
+
+**SystemDS vs vLLM (same model, same server — Qwen2.5-3B-Instruct on H100)**:
+SystemDS c=1 matches vLLM Qwen 3B accuracy exactly on all 5 workloads (68%, 52%, 60%, 50%, 90%). This confirms that the `llmPredict` instruction produces identical results to calling vLLM directly. Both use temperature=0.0 (deterministic), same prompts, same inference server. c=4 shows minor variation on reasoning (64% vs 60%) and summarization (62% vs 50%) because concurrent requests cause vLLM to batch them differently, introducing floating-point non-determinism in GPU computation.
+
+**OpenAI gpt-4.1-mini vs local models**:
+OpenAI achieves the highest accuracy on all 5 workloads. The gap is largest on math (94% vs 68% for Qwen 3B) and smallest on embeddings (88% vs 90% for Qwen 3B, where the local model actually wins). OpenAI's advantage comes from model quality (much larger model), not serving infrastructure.
+
+**Qwen 3B vs Mistral 7B (different models, same vLLM server)**:
+Despite being smaller (3B vs 7B parameters), Qwen outperforms Mistral on math (68% vs 38%) and embeddings (90% vs 82%). Mistral is better on reasoning (68% vs 60%) and summarization (68% vs 50%). This shows that model architecture and training data matter more than parameter count alone. Mistral's low math score (38%) has two causes: in 20 of 31 incorrect samples the model computed the wrong answer entirely (wrong formulas, negative results, or refusing to solve), and in 10 cases the correct answer appeared in the response but the number extractor grabbed an intermediate value instead due to verbose chain-of-thought formatting.
+
+**Ollama llama3.2 3B (MacBook CPU)**:
+Ollama leads on summarization (80%) likely because llama3.2's training emphasized concise outputs that align well with the ROUGE-1 threshold. It scores lowest on embeddings (40%) because the model frequently refuses the similarity-rating task or defaults to high scores regardless of actual similarity.
+
+### Per-prompt latency (mean ms/prompt, n=50)
+
+| Workload | Ollama (MacBook CPU) | OpenAI (MacBook → Cloud) | vLLM Qwen 3B (H100) | SystemDS Qwen 3B c=1 (H100) |
+|---|---|---|---|---|
+| math | 5781 | 3630 | 4619 | 2273 |
+| json_extraction | 1642 | 1457 | 1151 | 610 |
+| reasoning | 5252 | 2641 | 2557 | 1261 |
+| summarization | 1079 | 1036 | 791 | 373 |
+| embeddings | 371 | 648 | 75 | 41 |
+
+**Note on measurement methodology**: Latency numbers are **not directly comparable** across backends because each measures differently. The vLLM backend uses Python `requests` with streaming (SSE token-by-token parsing adds overhead). SystemDS measures Java-side `HttpURLConnection` round-trip time (non-streaming, gets full response at once). Ollama measures Python HTTP round-trip on CPU. OpenAI includes network round-trip to cloud servers. The accuracy comparison is the apples-to-apples metric since all backends process the same prompts.
+
+### SystemDS concurrency scaling (throughput)
+
+| Workload | c=1 (req/s) | c=4 (req/s) | Speedup |
+|---|---|---|---|
+| math | 0.44 | 1.63 | 3.71x |
+| json_extraction | 1.62 | 5.65 | 3.49x |
+| reasoning | 0.79 | 3.11 | 3.95x |
+| summarization | 2.62 | 7.27 | 2.78x |
+| embeddings | 20.07 | 46.34 | 2.31x |
+
+Throughput = `n / total_wall_clock_seconds` (measured Python-side, end-to-end including JMLC overhead). Theoretical maximum speedup is 4x. Math and reasoning (longer generation, ~1-2s per prompt) get closest to 4x because the per-request time dominates. Embeddings (very short responses, ~41ms per prompt) only achieves 2.31x because JMLC pipeline overhead becomes proportionally significant.
+
+### Cost comparison
+
+All backends incur compute cost (hardware amortization + electricity) for the machine running them. GPU backends run on the H100 server; Ollama and OpenAI run on a local MacBook. OpenAI additionally incurs API cost per token.
+
+**How cost is calculated**: `compute_cost = wall_clock_time × (hardware_cost / lifetime_hours + power_watts × electricity_rate) / 3600`. Assumptions: H100 server: 350W, $30K over 15K hours ($2.00/h + $0.105/h electricity = $2.105/h). MacBook: 50W, $3K over 15K hours ($0.20/h + $0.015/h electricity = $0.215/h). OpenAI API cost recorded by the runner from response headers (`x-usage` header).
+
+| Backend | Hardware | Wall clock (250 queries) | Compute cost | API cost | Total cost | Cost per query |
+|---|---|---|---|---|---|---|
+| Ollama llama3.2 3B | MacBook CPU | 706s | $0.0422 | -- | $0.0422 | $0.000169 |
+| OpenAI gpt-4.1-mini | MacBook + API | 471s | $0.0281 | $0.0573 | $0.0855 | $0.000342 |
+| vLLM Qwen 3B | H100 GPU | 460s | $0.2688 | -- | $0.2688 | $0.001076 |
+| SystemDS Qwen 3B c=1 | H100 GPU | 230s | $0.1345 | -- | $0.1345 | $0.000538 |
+| SystemDS Qwen 3B c=4 | H100 GPU | 64s | $0.0372 | -- | $0.0372 | $0.000149 |
+
+OpenAI API cost breakdown (recorded per run): math $0.0227, reasoning $0.0172, json_extraction $0.0080, summarization $0.0076, embeddings $0.0019.
+
+## Conclusions
+
+1. **SystemDS `llmPredict` produces identical results to vLLM**: SystemDS c=1 matches vLLM Qwen 3B accuracy exactly on all 5 workloads (68%, 52%, 60%, 50%, 90%). Both use the same model on the same inference server with temperature=0.0, confirming that the `llmPredict` DML built-in adds no distortion to model outputs.
+
+2. **Concurrency scales throughput 2.3-3.9x**: The `ExecutorService` thread pool in the `llmPredict` instruction dispatches up to 4 requests concurrently. Longer-running workloads (math 3.71x, reasoning 3.95x) get closest to the theoretical 4x speedup. Short workloads (embeddings 2.31x) are limited by JMLC pipeline overhead.
+
+3. **OpenAI leads on accuracy but costs more per query**: gpt-4.1-mini achieves the highest accuracy on all 5 workloads (94% math, 84% json, 70% reasoning, 88% summarization, 88% embeddings) but at $0.000342/query. SystemDS c=4 achieves $0.000149/query — 56% cheaper — with competitive accuracy on focused tasks like embeddings (90% vs 88%).
+
+4. **Model quality matters more than parameter count**: Qwen 3B outperforms the larger Mistral 7B on math (68% vs 38%) and embeddings (90% vs 82%), while Mistral 7B is stronger on reasoning (68% vs 60%) and summarization (68% vs 50%). The serving framework (vLLM vs SystemDS) has zero impact on accuracy when using the same model.
+
+5. **Concurrency reduces compute cost on GPU**: SystemDS c=4 at $0.000149/query is the cheapest GPU option — 86% less than vLLM's $0.001076/query — because higher throughput means less wall-clock time per query. Ollama on MacBook CPU is cheapest overall ($0.000169/query) due to low hardware and power costs, but 11x slower.
+
+6. **Latency measurements are not comparable across backends**: Each backend uses a different HTTP client (Python streaming, Java non-streaming, cloud API) and measures time differently. Per-prompt latency should only be compared within the same backend across workloads, not across backends.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -156,4 +156,3 @@ docker/mountFolder/*.bin
		docker/mountFolder/*.bin.mtd

		SEAL-*/