Expert evaluation server for AI-generated bumblebee images. Experts evaluate synthetic images through a two-stage workflow:
- Stage 1 — Blind Identification: Expert identifies the species from the image alone (Family → Genus → Species dropdowns, includes "No match" option)
- Stage 2 — Detailed Evaluation: After revealing the target species and reference images, expert rates morphological fidelity, diagnostic completeness, caste identification, and failure modes
Three target species from the expert validation set (150 images, 50 per species):
- Bombus ashtoni — Ashton Cuckoo Bumble Bee
- Bombus sandersoni — Sanderson's Bumble Bee
- Bombus flavidus — Fernald's Cuckoo Bumble Bee
expert_eval_server/
├── app.py # Flask application (routes, DB models, logic)
├── constants.py # Configuration (mode, species, taxonomy, evaluation options)
├── wsgi.py # Gunicorn entry point
├── requirements.txt # Python dependencies
├── gunicorn_config.py # Gunicorn settings (workers, ports, logging)
├── nginx.conf # Nginx reverse proxy config (optional)
├── deploy_server.sh # Start/stop/restart/switch-mode server
├── reset_server.sh # Hard reset (kill processes, clear sessions)
├── templates/
│ ├── start_evaluation.html # Landing page
│ ├── evaluation_form.html # Two-stage evaluation interface
│ ├── complete_evaluation.html # Completion page
│ └── already_completed.html # Shown to returning users
├── assets/
│ ├── bumblebee_images_metadata.json # Image metadata (auto-generated)
│ └── expert_validation_manifest.json # LLM judge manifest (source of truth)
├── scripts/
│ └── generate_expert_metadata.py # Generate metadata from local images + manifest
├── static/
│ ├── bumblebees/ # 150 synthetic images (50 per species)
│ └── references/ # Reference images for each species
├── instance/ # SQLite databases (auto-created)
├── logs/ # Gunicorn logs (auto-created)
└── flask_session/ # Server-side session files (auto-created)
- Python 3.10+
- Nginx is optional (Gunicorn alone is sufficient for single-user evaluation)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtImages are already bundled in static/:
static/bumblebees/{species}/— 150 synthetic images (from expert validation set)static/references/{species}/— Reference images for each species
python scripts/generate_expert_metadata.pyThis scans static/bumblebees/ and reads assets/expert_validation_manifest.json to produce assets/bumblebee_images_metadata.json. All paths are local — no external dependencies.
Edit constants.py:
MODE = "calibration" # 10 images, calibration DB (for practice)
MODE = "full" # 150 images, production DB (real evaluation)Or use the CLI to switch (see below).
bash deploy_server.sh start # Start Gunicorn (skips Nginx if not installed)
bash deploy_server.sh stop # Stop server
bash deploy_server.sh restart # Restart server
bash deploy_server.sh status # Show mode, running status
bash deploy_server.sh switch-mode # Toggle calibration <-> full (clears sessions, restarts)The app runs on http://localhost:5050 (Gunicorn direct). If Nginx is installed, it also proxies on port 8080.
If the server runs on a remote machine (e.g. MIT cluster), set up an SSH tunnel from your laptop:
ssh -L 5050:localhost:5050 msun14@login007.mit.eduThen open http://localhost:5050/?PARTICIPANT_ID=your_name in your laptop browser.
# Start in calibration mode (10 images, practice)
bash deploy_server.sh start
# When ready for real evaluation:
bash deploy_server.sh switch-mode # switches to full (150 images)Each mode has its own database — switching does not affect the other mode's data. Sessions are cleared automatically on switch (required because session stores progress).
bash reset_server.shKills all Gunicorn processes, clears logs and Flask sessions. Does not delete databases.
source venv/bin/activate
python app.py
# Runs on http://localhost:5050http://<host>:<port>/?PARTICIPANT_ID=<id>&STUDY_ID=<study>&SESSION_ID=<session>
| Parameter | Required | Default | Description |
|---|---|---|---|
PARTICIPANT_ID |
Yes | 0 |
Unique identifier for the participant |
STUDY_ID |
No | 0 |
Study identifier (useful for grouping) |
SESSION_ID |
No | 0 |
Session identifier |
Examples:
http://localhost:5050/?PARTICIPANT_ID=expert_1
http://localhost:5050/?PARTICIPANT_ID=expert_1&STUDY_ID=pilot&SESSION_ID=1
Different PARTICIPANT_ID values create separate user entries in the same database. Use a new ID to start fresh without clearing the DB.
Since the server runs on localhost, remote participants cannot access it directly. Use ngrok to create a public tunnel:
-
Install ngrok:
brew install ngrok # macOS -
Start the tunnel (while the server is running):
ngrok http 8080 # if using Nginx ngrok http 5050 # if using Gunicorn directly
-
ngrok will output a public URL like:
Forwarding https://abc123.ngrok-free.app -> http://localhost:8080 -
Send the public URL to participants:
https://abc123.ngrok-free.app/?PARTICIPANT_ID=expert_1&STUDY_ID=pilot&SESSION_ID=1
Note: The free ngrok tier generates a new URL each time. For a stable URL, use a paid plan or consider deploying to a cloud server.
GET /status— JSON with evaluation statistics (mode, users, completions)GET /export— Download all evaluation results as CSV
SQLite databases are stored in instance/:
| Mode | Database file | Purpose |
|---|---|---|
| calibration | bumblebee_evaluation_calibration.db |
Practice/testing (safe to delete) |
| full | bumblebee_evaluation_full.db |
Real evaluation data (keep this) |
Two tables:
- insect_evaluation — all evaluation responses (blind ID, morphology scores, caste, failure modes, timing)
- evaluation_users — user tracking and subset assignment
| Stage | Question | Format |
|---|---|---|
| 1 | Species identification (blind) | Family → Genus → Species dropdowns (includes "No match") |
| 2 | Morphological fidelity | 1–5 slider per feature (legs, wings, head, abdomen, thorax) |
| 2 | Diagnostic completeness | Single-select (not identifiable / family / genus / species) |
| 2 | Caste identification (blind) | Dropdown (worker / queen / male / female / uncertain) |
| 2 | Failure modes: Species fidelity | Multi-select checkboxes + "Other" free text |
| 2 | Failure modes: Image quality | Multi-select checkboxes + "Other" free text |