Fake Document Generator

This project is a Node.js and TypeScript-based tool for generating realistic fake documents for the purpose of benchmarking and testing AI document processing systems.

It is designed to be extensible, allowing for the addition of new document types and formats over time.

Current Features

Change Order Log Generation: The primary module generates highly varied and realistic Change Order Logs (CORs) based on real-world samples.
Multiple Formats: Outputs documents in five formats: PDF, PNG, CSV, XLSX, and a "golden data" JSON.
Golden Data JSON: For each generated document, a corresponding JSON file is created with the headers and rows of the table data. This is ideal for validating LLM parsing and extraction.
High-Fidelity Rendering: Uses a headless browser (Playwright) to render HTML templates into pixel-perfect PDFs and images, allowing for complex layouts and styling.
Dynamic & Randomized Data: Each run produces a unique set of documents with a random combination of columns, header names, and data points, ensuring a wide variety of test cases.
Realistic Scenarios: The data generation logic includes rules to create more realistic documents, such as:
- Optional, multi-line page headers that may or may not precede the main table.
- Consistent pcoNum formatting within a single document.
- Randomly skipped pcoNum values to simulate real-world logs.
- Revision numbers incorporated directly into the pcoNum (e.g., CO 1.1).
- Data consistency between status and date columns.

Architecture

The generator uses a modular, four-layer architecture:

Data Generation Layer (src/data-generation): Provides raw, untyped fake data using @faker-js/faker.
Data Shaping Layer (src/data-shaping): Transforms the raw data into a structured format for a specific document type (e.g., a Change Order Log). This is where the logic for data randomization and realism resides.
HTML Templating Layer (src/html-templates): Takes the shaped data and generates a complete HTML document with embedded CSS for styling.
Document Rendering Layer (src/document-rendering):
- Uses Playwright to convert the HTML template into PDFs and PNGs.
- Uses exceljs and custom logic to create XLSX and CSV files.

How to Use

Prerequisites

Node.js (v18 or higher recommended)
npm
Playwright

Installation

Clone the repository.
Install the dependencies:
```
npm install
npx playwright install
```

Generating Documents

The project is configured with an npm script to handle the entire generation process.

Run the generator:
```
npm run generate:cor-log
```
To generate a preformatted COR log for bulk import, run:
```
npm run generate:preformatted-cor-csv
```
Check the output:
- The generate:cor-log script will automatically clear and then populate the artefacts/cor-logs directory with a fresh batch of documents, featuring 4 different row counts and 5 different file formats.
- The generate:preformatted-cor-csv script will populate the artefacts/preformatted-cor-logs directory with a single CSV file containing 80 rows of data.

Extending the Generator

To add a new document type (e.g., "Invoice"):

Add a new interface for the document data to src/types.ts.
Create a new data shaper function in src/data-shaping/generate-invoice.ts.
Create a new HTML template in src/html-templates/invoice-template.ts.
Create a new main script in src/generate-invoice.ts.
Add a new script to package.json: "generate:invoice": "npm run build && node dist/generate-invoice.js".

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
bulk_upload_COR_empty_import_template.csv		bulk_upload_COR_empty_import_template.csv
bulk_upload_COR_prefilled_import_template.csv		bulk_upload_COR_prefilled_import_template.csv
package-lock.json		package-lock.json
package.json		package.json
plan.md		plan.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fake Document Generator

Current Features

Architecture

How to Use

Prerequisites

Installation

Generating Documents

Extending the Generator

About

Uh oh!

Releases

Packages

Languages

ExtrackerInc/fake-document-generator

Folders and files

Latest commit

History

Repository files navigation

Fake Document Generator

Current Features

Architecture

How to Use

Prerequisites

Installation

Generating Documents

Extending the Generator

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages