This repository explores a pipeline for turning long (french) parliamentary audio/video sessions into structured, searchable data: transcription, diarization, speaker-attributed segments, enrichment, and eventually storage/retrieval.
- Modular pipeline stages implemented
- Multiple full session experiments conducted
- Merge logic under active experiment (prioritizing precision to eliminate hallucination)
- Pipeline validated per-stage before integration. Orchestration in progress.
The project is about reconciling several imperfect signals into a higher-confidence representation of what happened in the room, then using the cleanest verified regions to improve speaker attribution over time.
- Whisper gives the strongest text signal, but can hallucinate, repeat text, or shift boundaries.
- pyannote gives the strongest speaker-activity signal, but its turns do not align cleanly with transcription segments.
- Silero VAD gives an independent speech-presence signal that can expose silence, gaps, and suspicious transcript regions.
- librosa audio audits add low-level acoustic evidence: energy, dB, zero-crossing rate, spectral centroid, bandwidth, and flatness. These help separate clean speech from silence, applause, noise, and timestamp pathologies.
- NLP / entity extraction helps infer speaker names and turns, but has to be treated as contextual evidence rather than ground truth. The current direction is CamemBERT rather than the earlier spaCy experiments. French parliamentary phrasing is difficult and dependency parsing alone was too brittle.
- The official parliamentary compte rendu is useful institutional reference material, but it is not a verbatim acoustic transcript.
The current direction is validation-driven: keep disagreement visible, mark uncertain regions, and only merge signals when there is enough evidence. Once a segment is sufficiently validated, it becomes useful for downstream stages: validating named speakers, confirming speaker turns against diarization, and computing cleaner voice embeddings.
The notes in this repository document several experiments:
- Whether Whisper timestamps can be used as the main merge timeline.
- How pyannote diarization boundaries differ from Whisper transcription boundaries.
- Whether Silero VAD is a better initial temporal anchor for "speech exists here".
- How Whisper behaves with different VAD settings:
VAD OFF,VAD 1000 ms, andVAD 2000 ms. - Which confidence proxies may help detect weak or hallucinated segments.
- Which librosa metrics help distinguish clean speech, silence, applause, and noisy acoustic events.
- How to flag pathological Whisper segments, such as very long timestamp spans containing only a short phrase.
- How to detect repeated hallucinations, orphan segments, multi-speaker overlaps, and silence-gap failures.
- How official parliamentary text differs from both acoustic speech and ASR output.
- How NLP can help identify speaker mentions from French parliamentary phrasing.
Lesson so far: the useful signal comes from the agreement and disagreement between components.
The initial merge approach treated Whisper segment timestamps as the primary timeline and projected diarization onto them. Whisper and pyannote optimize for different tasks, so their segment boundaries drift, disagree, and sometimes encode different notions of structure.
Journal 01: Merging Whisper and Pyannote Segments.
Silero VAD was added as an independent speech-presence layer. On the tested session, it detected the beginning of speech within roughly half a second of the official session start and produced more granular, human-plausible speech regions than the earlier pyannote-derived VAD approach.
Journal 04: New Transcription Run
Journal 05: Whisper VAD Value
Instead of relying only on model outputs, the pipeline now also measures acoustic properties over time.
Journal 06: Librosa Audio Audit
Journal 07: Heuristic Flags For Whisper Segments
Journal 08: Flag audits
The official parliamentary compte rendu is not a strict ground-truth transcript. It is an institutional record: edited, normalized, occasionally inconsistent, and sometimes non-verbatim.
The notes separate at least three kinds of truth:
- acoustic truth: what was physically spoken
- speaker truth: who spoke and when
- institutional truth: what the official record preserves
Journal 02: Merging With New Input
Official human reference
Merged Whisper/human comparison artifact
The older notes explore using French NLP to identify relevant person mentions and infer turn-taking. A plain PER named-entity tag is not enough. The useful signal comes from the surrounding parliamentary context: formulas such as "la parole est à", "va etre posee par", speaker announcements, replies, and turn transitions.
The first version used spaCy and dependency-pattern scoring. The current direction is to use CamemBERT for stronger French language understanding, then validate inferred speakers against the temporal evidence:
- identify candidate
PERmentions linked to speaking turns - assign likely speakers through a forward sweep over the transcript
- run a backward validation sweep against pyannote speaker turns
- keep only high-confidence speaker-attributed regions as verified segments
Those verified segments then become training/evaluation material for voice identity. By recomputing pyannote speaker centroids from cleaner segments, the pipeline should produce more stable voice fingerprints for known parliamentary speakers. Those saved centroids can later be used to recognize the same speaker across future Assembly sessions.
See Step 5: Identifying Speakers.
The plots under docs/plots visualize signal comparisons and help explain why the pipeline now treats agreement/disagreement as data.
This plot compares VAD and diarization activity over time:
The intended pipeline shape is:
audio/video
-> audio extraction / normalization
-> Silero VAD speech mask
-> pyannote diarization
-> Whisper transcription
-> librosa audio audit
-> temporal cross-check
-> segment-level quality flags
-> merge with uncertainty flags
-> CamemBERT speaker/entity enrichment
-> forward speaker sweep
-> backward pyannote turn validation
-> verified clean speaker segments
-> speaker centroid / voice fingerprint refinement
-> structured storage
-> search / retrieval
-> cross-session speaker matching
The project currently prioritizes precision and explainability over forcing a complete transcript. No absolute ground truth exists., the only possible verification is manually listening.
Missing or uncertain regions should remain visible so they can be reviewed, filtered, or reprocessed.
