A powerful Electron desktop application that processes PDF files through OCR, applies intelligent text correction, and outputs clean Markdown files with real-time processing visualization.
π Advanced OCR Processing
- High-quality text extraction using Tesseract.js
- Support for multiple languages
- Confidence-based quality assessment
π§ Intelligent Text Correction
- Automatic correction of common OCR errors
- Column layout restoration
- Spell checking with custom dictionaries
- Natural language processing for improved readability
π Smart Markdown Generation
- Automatic heading detection
- List and table recognition
- Proper formatting and structure
- Metadata inclusion
β‘ Real-time Processing
- Live progress tracking
- Interactive text preview
- Batch processing support
- Error handling and recovery
- Node.js 18.x or higher
- NPM or Yarn package manager
# Clone the repository
git clone <repository-url>
cd PDF-Processor
# Install dependencies
npm install
# Start the application in development mode
npm run dev# Build the application
npm run build
# Create distributable packages
npm run dist- Add PDF URLs: Enter PDF URLs you want to process
- Configure Settings: Adjust OCR language, correction settings, and output format
- Select Output Folder: Choose where to save the processed files
- Start Processing: Click the start button to begin processing
- Monitor Progress: Watch real-time progress and preview results
- Access Results: Find your processed Markdown files in the output folder
- Use "Batch Input" to add multiple URLs at once
- Load URLs from text files
- Process multiple PDFs simultaneously
OCR Settings
- Language: Choose OCR language (English, Spanish, French, German)
- Confidence Threshold: Set minimum confidence for text extraction
- Layout Preservation: Maintain original document structure
Text Correction
- Spell Checking: Enable/disable automatic spell correction
- Aggressive Correction: More extensive error correction
- Custom Dictionary: Add domain-specific terms
Output Settings
- Format: Markdown, Plain Text, or HTML
- Metadata: Include processing information
- Formatting: Preserve original text formatting
- Main.js: Application lifecycle and window management
- PDF Processor: Handles PDF download and conversion
- OCR Worker: Manages Tesseract.js OCR processing
- Text Corrector: Applies intelligent text corrections
- App.js: Main application logic and UI management
- Components: Modular UI components for different features
- Styles: CSS for modern, responsive interface
- PDF URL validation and download
- PDF to high-resolution image conversion
- OCR text extraction with confidence scoring
- Multi-column layout restoration
- OCR error pattern correction
- Spell checking and grammar improvement
- Markdown formatting and structure detection
- Output file generation and metadata inclusion
PDF-Processor/
βββ src/
β βββ main/ # Main process code
β β βββ main.js # Entry point
β β βββ workers/ # Background processing
β βββ renderer/ # UI and frontend
β β βββ index.html # Main interface
β β βββ js/ # JavaScript modules
β β βββ styles/ # CSS styling
β βββ utils/ # Shared utilities
βββ tests/ # Test files
βββ build/ # Build configuration
βββ dist/ # Distribution files
npm run dev # Development mode with hot reload
npm test # Run test suite
npm run build # Build for production
npm run dist # Create distributable packages
npm run lint # Code linting
npm start # Start built application- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Run all tests
npm test
# Run specific test suites
npm test -- --grep "OCR"
npm test -- --grep "TextCorrection"NODE_ENV: Development or production modeLOG_LEVEL: Logging verbosity (debug, info, warn, error)
Settings are automatically saved to the user's application data directory using electron-store.
OCR Not Working
- Ensure Tesseract.js dependencies are properly installed
- Check network connectivity for language data downloads
- Verify PDF image quality and resolution
Processing Errors
- Check PDF URL accessibility
- Ensure sufficient disk space for temporary files
- Verify output folder write permissions
Performance Issues
- Reduce batch size for large PDFs
- Lower OCR quality settings if needed
- Close other memory-intensive applications
Application logs are available in:
- Windows:
%APPDATA%/pdf-processor/logs/ - macOS:
~/Library/Logs/pdf-processor/ - Linux:
~/.config/pdf-processor/logs/
- Operating System: Windows 10+, macOS 10.14+, or Linux (Ubuntu 18.04+)
- Memory: 4GB RAM minimum, 8GB recommended
- Disk Space: 500MB for application, additional space for processing
- Network: Internet connection for PDF downloads and OCR language data
- Standard PDF documents
- Scanned documents (images embedded in PDF)
- Multi-page documents
- Password-protected PDFs (with manual password entry)
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- Tesseract.js for OCR capabilities
- Electron for cross-platform desktop framework
- pdf2pic for PDF to image conversion
- Compromise for natural language processing
- Natural for text processing utilities
For support, bug reports, or feature requests:
- Create an issue on GitHub
- Check the documentation
- Review troubleshooting guide above
Made with β€οΈ for better document processing