Password-list analysis & Hashcat artifact generator
This repository contains extractor.py — a robust Python utility to analyze password lists, infer common password transforms, and export Hashcat-friendly artifacts such as prioritized candidate lists, masks (.hcmask), suffix lists, and .rule files.
The script supports two dictionary substring engines:
- pyahocorasick (recommended): a fast C-based Aho–Corasick automaton (optional dependency).
- DictTrie (built-in): a pure-Python trie fallback when
pyahocorasickis not available.
- Analyze password lists and produce
analysis.jsonlcontaining candidate de-leeted variants and segmentations. - Generate a
wordfreqfrom raw text inputs. - Infer common transforms (suffixes, years, capitalization, substitutions) from a cracked password list.
- Export Hashcat artifacts: prioritized wordlist, masks, suffix lists, and
.rulefiles. - Combined workflow
generate-artifactsto run export + generate masks / rules. - Atomic file writes, progress indicators (when
tqdminstalled), and graceful SIGINT handling.
Recommended: use a virtual environment.
python -m venv .venv
source .venv/bin/activate # or .\.venv\Scripts\activate on Windows
python -m pip install --upgrade pipOptional (recommended for speed):
python -m pip install pyahocorasick tqdmpyahocorasick provides the fastest dictionary substring matching; the script works without it using the built-in trie.
Save the script as extractor.py and run:
python extractor.py <command> [options]Available commands:
analyze— Analyze a password list and writeanalysis.jsonl+templates.json.gen-wordfreq— Generate awordfreq.txtfile from sample text.infer-transforms— Infer transforms from a cracked list using a dictionary.export— Export masks, rules, prioritized wordlist, and suffix files fromanalysis.jsonlandtransforms.json.generate-artifacts— Runexportand then generate masks/.rule files (combined workflow).gen-hcmask-rules— Generatemasks.hcmaskand Hashcat.rulefiles fromanalysis.jsonl.
Analyze a password list:
python extractor.py analyze --pw-list path/to/passwords.txt --dict wordfreq.txt --out-dir out/analysis --beam 500 --topk-per-pw 5Generate artifacts (export + rules):
python extractor.py generate-artifacts --analysis out/analysis/analysis.jsonl --transforms out/transforms.json --templates out/analysis/templates.json --out-dir out/artifactsGenerate a wordfreq from a corpus:
python extractor.py gen-wordfreq --input samples/corpus.txt --out wordfreq.txt --min-token-len 2Infer transforms from a cracked file:
python extractor.py infer-transforms --cracked cracked.txt --dict wordfreq.txt --out-dir out/transforms--pw-list: path to password list foranalyze.--dict: path towordfreqdictionary used for scoring/segmentation.--out-dir: output directory for generated artifacts.--topk-per-pw: how many top candidates to keep per password (default: 5).--beam: candidate-generation beam width (default: 500). Lowering this speeds execution with modest coverage loss.--no-progress: disable progress bars.
When running analyze and export, the following outputs are produced (under the --out-dir):
analysis.jsonl— one JSON object per password withorig,template, andcandidates(candidate, rank_score, segmentation).templates.json— aggregated template counts and frequencies.prioritized_wordlist.txt— deduplicated prioritized candidate list (used for targeted cracking).masks/andmasks.hcmask— Hashcat masks derived from templates.rules/— generated.rulefiles (capitalize, append/prepend affixes, etc.).suffixes/— generated suffix lists (e.g., digits, years).rules/all_rules.rule— combined deduplicated rule file.
- Install
pyahocorasickfor much faster substring matching:
python -m pip install pyahocorasick- Run
analyzewith a reduced--beamand--topk-per-pwfor large lists, e.g.--beam 200 --topk-per-pw 3. - Use
--no-progresswhen running in non-interactive environments. - Consider parallelizing large analyses by splitting the input file and running multiple
analyzejobs concurrently (the script is streaming-safe).
- The script is implemented in pure Python (3.8+) and avoids non-standard dependencies except optional ones listed above.
- Logging is available; run with
python extractor.py ...and inspect log output.
Suggested test workflow:
- Build a small
wordfreq.txt(or usegen-wordfreqon a small corpus). - Create a tiny
passwords.txtcontaining known test cases (e.g.,john.doe1990,p@ssword123,adm1n). - Run
analyzeand inspectanalysis.jsonlto verify segmentation and candidate generation.
- If the script reports
analysis.jsonlnot found, ensure--out-diris writable and theanalyzestep completed without interruption. - If substring/word detection seems weak, check that the
--dict(wordfreq.txt) contains relevant tokens (names, brands, etc.). - If you see performance issues, try installing
pyahocorasickand decreasing--beam.
Contributions are welcome. Suggested improvements:
- Add an optional multiprocessing/parallel mode for
analyze(safe chunked processing + deterministic merge). - Add optional integration with
rapidfuzzfor fuzzy substring matching. - Provide a small test harness (unit tests) and CI configuration.
When contributing, provide tests and maintain backward-compatible CLI behavior.
This project is provided "as-is". No license is included by default — please add a LICENSE file if you intend to redistribute under a specific license.
- v1.0 — Baseline: analyze, gen-wordfreq, infer-transforms, export, gen-hcmask-rules.
- v1.x — Added pyahocorasick support (recommended) and DictTrie fallback, atomic writes, and de-leet/tokenization improvements.
For questions or requests, open an issue or contact the maintainer.