Summary
Provide a separate command-line tool for keyword extraction and term analysis using TF-IDF.
Motivation
TF-IDF is a vectorizer, not a classifier. It transforms text into weighted term vectors. Cramming it into the classifier CLI would be dishonest about what the tool does.
A dedicated keywords tool enables:
- Keyword extraction from documents
- Understanding term importance
- Building vocabularies for other tools
- Document similarity analysis
- Preprocessing pipelines
Proposed CLI
Fit (build vocabulary)
# Build vocabulary from files
keywords fit corpus/*.txt
# From stdin
cat documents.txt | keywords fit
# Custom model path
keywords fit -m vocab.json corpus/*.txt
Transform (extract terms)
# Get weighted terms from text
keywords "Ruby is a programming language"
# => ruby:0.52 programming:0.41 language:0.38
# From stdin
echo "some document text" | keywords
# => document:0.45 text:0.42
# Top N terms only
keywords -n 5 "long document with many terms..."
# => term1:0.5 term2:0.4 term3:0.3 term4:0.2 term5:0.1
Extract (convenience alias)
# Extract keywords from a file
keywords extract article.txt
# => machine:0.61 learning:0.58 neural:0.45 network:0.42
# From URL (with curl)
curl -s https://example.com/article | keywords extract
Info
keywords info
# => Documents: 1,234
# => Vocabulary: 5,678
# => Min DF: 1
# => Max DF: 1.0
Options
-m, --model FILE Model file (default: ./keywords.json)
-n, --top N Show top N terms only
-q Quiet mode (for scripting)
-v, --version Show version
-h, --help Show help
Fit-specific options
--min-df N Minimum document frequency (default: 1)
--max-df N Maximum document frequency ratio (default: 1.0)
--ngram MIN,MAX N-gram range (default: 1,1)
Examples
# Build vocabulary and extract keywords
keywords fit articles/*.txt
keywords "What are the main topics?"
# => topics:0.6 main:0.4
# Pipeline with classifier
keywords fit corpus.txt
keywords extract article.txt | head -5 # top 5 terms
# Compare documents (output as TSV for scripting)
keywords -q doc1.txt > /tmp/v1.txt
keywords -q doc2.txt > /tmp/v2.txt
# Then use external tool for cosine similarity
Design Principles
- Separate tool: TF-IDF is not classification, don't pretend it is
- Transform is default: No subcommand needed for primary action
- Stdin works: Pipe-friendly
- Scriptable output:
-q for machine-readable format
Implementation Notes
- Use
optparse (stdlib)
- Reuse existing
Classifier::TFIDF class (when implemented)
- Exit codes: 0 success, 1 error, 2 usage error
- Default model:
./keywords.json
Related
Summary
Provide a separate command-line tool for keyword extraction and term analysis using TF-IDF.
Motivation
TF-IDF is a vectorizer, not a classifier. It transforms text into weighted term vectors. Cramming it into the
classifierCLI would be dishonest about what the tool does.A dedicated
keywordstool enables:Proposed CLI
Fit (build vocabulary)
Transform (extract terms)
Extract (convenience alias)
Info
Options
-m, --model FILE Model file (default: ./keywords.json) -n, --top N Show top N terms only -q Quiet mode (for scripting) -v, --version Show version -h, --help Show helpFit-specific options
Examples
Design Principles
-qfor machine-readable formatImplementation Notes
optparse(stdlib)Classifier::TFIDFclass (when implemented)./keywords.jsonRelated