Skip to content

Enhancements to data sampler#1916

Open
ajaits wants to merge 4 commits intodatacommonsorg:masterfrom
ajaits:dc-sampler
Open

Enhancements to data sampler#1916
ajaits wants to merge 4 commits intodatacommonsorg:masterfrom
ajaits:dc-sampler

Conversation

@ajaits
Copy link
Contributor

@ajaits ajaits commented Mar 13, 2026

Adds the following new features for data sampler:

  • exhaustive mode: enables sampling of all unique value in selected columns
  • include values: list of values to always include for specific columns
    This is used to create golden outputs with selected statvars such as NL index.
  • more tests added by gemini-cli

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data sampling utility by introducing more flexible and powerful control mechanisms. It allows for comprehensive data capture through an exhaustive mode, ensures the inclusion of specific critical values, and provides finer control over the number of unique values sampled per column. These additions aim to make the data sampler more robust and adaptable to diverse data processing requirements, particularly for scenarios where complete or targeted data representation is crucial.

Highlights

  • Exhaustive Sampling Mode: Introduced a new sampler_exhaustive flag that, when enabled, automatically configures the data sampler to capture all unique values by setting sampler_output_rows and sampler_uniques_per_column to infinity and sampler_rows_per_key to 1.
  • Mandatory Value Inclusion: Added a sampler_column_keys flag and corresponding logic to allow users to specify a list of 'column:file' pairs. Values from the specified files will always be included in the sample if they appear in the input data, ensuring critical data points are not missed during sampling.
  • Configurable Unique Values Per Column: Implemented a new sampler_uniques_per_column flag to control the maximum number of unique values tracked per column, offering more granular control over the sampling process. Setting this to 0 or -1 tracks all unique values.
  • Enhanced Test Coverage: Significantly expanded the test suite for the data sampler, including new tests for exhaustive_mode, must_include_values, uniques_per_column, and load_column_keys, and unskipped previously pending tests for rows_per_key and cell_value_regex with updated implementations.
Changelog
  • tools/statvar_importer/data_sampler.py
    • Added new flags: sampler_uniques_per_column, sampler_exhaustive, and sampler_column_keys.
    • Imported mcf_file_util for namespace stripping in value comparisons.
    • Updated DataSampler.__init__ to accept column_include_values for programmatic inclusion.
    • Modified DataSampler.reset to apply sampler_exhaustive overrides and initialize _must_include_values and _must_include_indices.
    • Refactored column tracking logic by renaming _should_track_column to _is_unique_column and introducing a new _should_track_column for broader tracking.
    • Enhanced _process_header_row to map must-include values to their respective column indices.
    • Implemented a new _is_must_include method to efficiently check for values that must be included.
    • Adjusted select_row to handle infinite limits for max_count and max_uniques_per_col and incorporated logic to prioritize must-include values.
    • Added a new load_column_keys function to parse and load column-specific inclusion lists from files.
    • Updated the docstring for sample_csv_file to reflect the new sampler_exhaustive parameter.
    • Modified get_default_config to include the new flags and apply exhaustive mode overrides to the default configuration.
  • tools/statvar_importer/data_sampler_test.py
    • Unskipped and updated test_rows_per_key with a controlled input to verify sampler_rows_per_key functionality.
    • Unskipped and updated test_cell_value_regex with specific test data to confirm regex-based filtering.
    • Added test_exhaustive_mode to validate the behavior of the new exhaustive sampling feature.
    • Introduced test_must_include_values to ensure that specified values are always included in the sample.
    • Added test_uniques_per_column to verify the new limit on unique values per column.
    • Created test_load_column_keys to confirm the correct parsing and loading of column key files.
    • Added test_get_default_config to ensure the default configuration is correctly initialized with new flags.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the data sampler, adding an 'exhaustive' mode to capture all unique values and a mechanism to enforce the inclusion of specific values. The changes are well-implemented and include comprehensive tests for the new functionality. I've provided a couple of suggestions to improve code clarity and remove duplicated logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant