Script to generate and validate goldens for an import by ajaits · Pull Request #1905 · datacommonsorg/data

ajaits · 2026-03-09T12:07:29Z

Adding support for comparing output files against goldens.

Goldens can have a subset of columns. The validate_goldens verifies the output has all the combinations in the golden file.

Expected usage:

generate a set of golden output.csv that includes all important statvars and places using data_sampler.py
more in PR#1916
info a folder called golden_data.
enable golden validations in the per import validation config specifying the golden file to be compared against for each output in the manifest.json.
Relax the deletion threshold to allow higher deletions while ensuring goldens for prominent places and statvars have no deletions.

For more details, please refer to design doc

gemini-code-assist · 2026-03-09T12:07:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust new framework for validating data imports against predefined "golden" datasets. It provides utilities to both generate these golden sets from existing data, with advanced sampling and inclusion rules, and to perform comparisons to identify discrepancies. Additionally, it includes minor but important fixes to data processing logic, ensuring accurate handling of zero-value evaluations and proper CSV data interpretation during node loading.

Highlights

New Golden Validation Tool: Introduced validator_goldens.py to compare input data against expected "golden" sets, supporting flexible matching and normalization.
Golden Generation Capabilities: The new tool can generate golden sets from input data, including options for sampling, exhaustive sampling, and ensuring specific "must-include" values are present.
Enhanced Data Sampler: The data_sampler.py utility was updated to support "must-include" values and an "exhaustive" sampling mode, improving its utility for golden generation.
Improved Evaluation Handling: Fixed a bug in property_value_mapper.py to correctly process evaluation results that are numerically zero.
CSV Loading Refinement: Adjusted file_util.py to correctly load CSV data into dictionaries when using index-based keys, preventing unintended key-value parsing.

Changelog

tools/import_validation/validator_goldens.py
- Added a new script for validating data against golden sets.
- Implemented functions for generating golden sets from input data with various sampling and filtering options.
- Provided functionality to compare input nodes against golden nodes, identifying missing entries.
- Included utility functions for node fingerprinting and value normalization.
tools/import_validation/validator_goldens_test.py
- Added comprehensive unit tests for the validator_goldens.py script.
- Covered tests for configuration loading, value normalization, node fingerprinting, node comparison, and golden generation/validation workflows.
tools/statvar_importer/data_sampler.py
- Added new command-line flags sampler_uniques_per_column, sampler_exhaustive, and sampler_must_include.
- Integrated mcf_file_util for namespace stripping in must-include value processing.
- Enhanced the DataSampler class to support "must-include" values, ensuring specific data points are always sampled.
- Introduced an "exhaustive" sampling mode to capture all unique values.
- Updated configuration loading to handle new flags and exhaustive mode overrides.
tools/statvar_importer/property_value_mapper.py
- Corrected the evaluation logic to properly handle cases where an evaluated data value is 0, ensuring it is not incorrectly treated as None or an empty string.
util/file_util.py
- Modified file_load_csv_dict to prevent misinterpreting the second column as a value column when loading CSVs with key_index=True.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new tool for generating and validating 'golden' files for data imports, which is a great addition for ensuring data quality. The changes also include support for 'must-include' values during sampling and a fix for handling zero as a valid evaluation result.

My review has identified a critical bug in validator_goldens.py that would cause a NameError. I've also included suggestions to improve code quality by addressing mutable default arguments, removing leftover debug code, and refactoring duplicated logic. The new tests are comprehensive, but adding a test case for the load_must_include_values function would have caught the aforementioned bug.

tools/import_validation/validator_goldens.py

gemini-code-assist · 2026-03-09T12:12:41Z

tools/import_validation/validator_goldens.py

+                                       strip_namespaces=strip_namespaces)
+            # Initialize match count to 0.
+            golden_matches[key] = 0
+    logging.debug(f'DELETE: matching golden keys: {golden_matches.keys()}')


This appears to be a leftover debug logging statement. It should be removed before merging.

tools/import_validation/validator_goldens.py

tools/statvar_importer/data_sampler.py

ajaits added 2 commits December 30, 2025 19:53

Support 0 eval results

c641f9f

Goldens validator for import

58b4253

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

ajaits added 3 commits March 10, 2026 00:50

fix golden csv output

4128621

fix goldens generator

857cbde

move data_sampler to seperate pr

2960346

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to generate and validate goldens for an import#1905

Script to generate and validate goldens for an import#1905
ajaits wants to merge 5 commits intodatacommonsorg:masterfrom
ajaits:dc-goldens

ajaits commented Mar 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajaits commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajaits commented Mar 9, 2026 •

edited

Loading