Skip to content

Script to generate and validate goldens for an import#1905

Open
ajaits wants to merge 5 commits intodatacommonsorg:masterfrom
ajaits:dc-goldens
Open

Script to generate and validate goldens for an import#1905
ajaits wants to merge 5 commits intodatacommonsorg:masterfrom
ajaits:dc-goldens

Conversation

@ajaits
Copy link
Contributor

@ajaits ajaits commented Mar 9, 2026

Adding support for comparing output files against goldens.

Goldens can have a subset of columns. The validate_goldens verifies the output has all the combinations in the golden file.

Expected usage:

  1. generate a set of golden output.csv that includes all important statvars and places using data_sampler.py
    more in PR#1916
    info a folder called golden_data.
  2. enable golden validations in the per import validation config specifying the golden file to be compared against for each output in the manifest.json.
  3. Relax the deletion threshold to allow higher deletions while ensuring goldens for prominent places and statvars have no deletions.

For more details, please refer to design doc

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust new framework for validating data imports against predefined "golden" datasets. It provides utilities to both generate these golden sets from existing data, with advanced sampling and inclusion rules, and to perform comparisons to identify discrepancies. Additionally, it includes minor but important fixes to data processing logic, ensuring accurate handling of zero-value evaluations and proper CSV data interpretation during node loading.

Highlights

  • New Golden Validation Tool: Introduced validator_goldens.py to compare input data against expected "golden" sets, supporting flexible matching and normalization.
  • Golden Generation Capabilities: The new tool can generate golden sets from input data, including options for sampling, exhaustive sampling, and ensuring specific "must-include" values are present.
  • Enhanced Data Sampler: The data_sampler.py utility was updated to support "must-include" values and an "exhaustive" sampling mode, improving its utility for golden generation.
  • Improved Evaluation Handling: Fixed a bug in property_value_mapper.py to correctly process evaluation results that are numerically zero.
  • CSV Loading Refinement: Adjusted file_util.py to correctly load CSV data into dictionaries when using index-based keys, preventing unintended key-value parsing.
Changelog
  • tools/import_validation/validator_goldens.py
    • Added a new script for validating data against golden sets.
    • Implemented functions for generating golden sets from input data with various sampling and filtering options.
    • Provided functionality to compare input nodes against golden nodes, identifying missing entries.
    • Included utility functions for node fingerprinting and value normalization.
  • tools/import_validation/validator_goldens_test.py
    • Added comprehensive unit tests for the validator_goldens.py script.
    • Covered tests for configuration loading, value normalization, node fingerprinting, node comparison, and golden generation/validation workflows.
  • tools/statvar_importer/data_sampler.py
    • Added new command-line flags sampler_uniques_per_column, sampler_exhaustive, and sampler_must_include.
    • Integrated mcf_file_util for namespace stripping in must-include value processing.
    • Enhanced the DataSampler class to support "must-include" values, ensuring specific data points are always sampled.
    • Introduced an "exhaustive" sampling mode to capture all unique values.
    • Updated configuration loading to handle new flags and exhaustive mode overrides.
  • tools/statvar_importer/property_value_mapper.py
    • Corrected the evaluation logic to properly handle cases where an evaluated data value is 0, ensuring it is not incorrectly treated as None or an empty string.
  • util/file_util.py
    • Modified file_load_csv_dict to prevent misinterpreting the second column as a value column when loading CSVs with key_index=True.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new tool for generating and validating 'golden' files for data imports, which is a great addition for ensuring data quality. The changes also include support for 'must-include' values during sampling and a fix for handling zero as a valid evaluation result.

My review has identified a critical bug in validator_goldens.py that would cause a NameError. I've also included suggestions to improve code quality by addressing mutable default arguments, removing leftover debug code, and refactoring duplicated logic. The new tests are comprehensive, but adding a test case for the load_must_include_values function would have caught the aforementioned bug.

strip_namespaces=strip_namespaces)
# Initialize match count to 0.
golden_matches[key] = 0
logging.debug(f'DELETE: matching golden keys: {golden_matches.keys()}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This appears to be a leftover debug logging statement. It should be removed before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant