Skip to content

chore: add ko_KR locale to nemotron personas datasets#572

Merged
johnnygreco merged 6 commits intomainfrom
johnny/chore/add-ko-kr-nemotron-personas-locale
Apr 24, 2026
Merged

chore: add ko_KR locale to nemotron personas datasets#572
johnnygreco merged 6 commits intomainfrom
johnny/chore/add-ko-kr-nemotron-personas-locale

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

📋 Summary

  • Add ko_KR (South Korea) to the Nemotron-Personas locale registry alongside the existing 8 locales, registering its dataset size (2.66 GB) and NGC resource name so users can download it via the CLI
  • Update PII_FIELDS / PERSONA_FIELDS in dataset_based_person_fields.py to reflect the fields actually present in the installed parquet schemas — adds ko_KR-specific fields (health indicators, household/economic status, family_persona), en_SG-specific fields (industry, preferred_english_name), shared ethnic_background (en_US/en_SG), and drops fields no longer produced (commune, departement, prefecture, digital_skills); digital_skill replaces digital_skills
  • Bump fr_FR dataset size to 3.87 GB to match the current NGC artifact
  • Reconcile docs/concepts/person_sampling.md persona and PII field tables with the new field layout (adds Korea-Specific and Singapore-Specific sections, regroups religion/India fields)
  • Update CLI download/repository/service tests to expect 9 locales and assert on ko_KR and fr_FR

Register Korean (ko_KR, 2.66 GB) as an available managed persona
dataset locale, update related CLI/repository tests, and document the
new locale and its NGC download command.
Remove stale per-locale fields that no longer exist in any managed
parquet (commune, departement, prefecture), drop district from the
India-specific section since it's already listed in Core Fields,
rename digital_skills → digital_skill to match the actual ja_JP
column, and add sections for ko_KR, en_SG, and the en_US/en_SG
shared ethnic_background. Corrects the religion-family membership
to include en_SG.
The test asserts all 9 locales were downloaded but only enumerates 8
in its per-locale checks — fr_FR has been missing since before the
ko_KR addition. Align the enumeration with the count.
@johnnygreco johnnygreco requested a review from a team as a code owner April 24, 2026 16:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Docs preview: https://08d71afb.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@github-actions
Copy link
Copy Markdown
Contributor

PR #572 Review — chore: add ko_KR locale to nemotron personas datasets

Summary

Adds the ko_KR (South Korea) locale to the Nemotron-Personas registry and reconciles the PII_FIELDS / PERSONA_FIELDS lists in dataset_based_person_fields.py with the fields actually present in the installed parquet schemas. Bumps the fr_FR dataset size from 2.71 GB3.87 GB, renames digital_skillsdigital_skill, introduces Korea- and Singapore-specific field groups, drops commune / departement / prefecture / digital_skills, and updates three CLI test modules plus the person_sampling.md docs. Pure data/config + doc change — no engine logic is touched.

Findings

Correctness

  • LOCALES_WITH_MANAGED_DATASETS propagates automatically — it's derived from NEMOTRON_PERSONAS_DATASET_SIZES.keys() in constants.py:395, so adding ko_KR to the size map is sufficient; no other constant needs editing. ✔
  • fr_FR size change is a ~43% jump (2.71 GB3.87 GB). The PR description says this matches the current NGC artifact. Worth double-checking the NGC resource before merge — a stale size only affects the CLI download progress UX, but it's the kind of drift the runner-memory baselines are supposed to catch.
  • Field re-categorization is a behavior claim, not just a rename. The diff moves zone from "Japan-specific" to "India locales", moves district from "India-specific" to universal, and removes prefecture entirely. PII_FIELDS / PERSONA_FIELDS drive downstream filtering, so any locale whose parquet is missing a field now declared "universal" (e.g., district) will produce nulls or raise at generation time. Since the PR description says these lists were derived empirically from the installed schemas, this should be fine — but consider adding a lightweight regression test that iterates over LOCALES_WITH_MANAGED_DATASETS and asserts every universal field is present in each locale's parquet, so the next schema drift is caught automatically instead of manually.
  • digital_skillsdigital_skill renamegrep shows no remaining references to digital_skills anywhere in the repo, so the rename is complete. ✔
  • No code references to removed fields (commune, departement, prefecture) remain outside the removed lines. ✔

Conventions / Style

  • New groupings in PII_FIELDS / PERSONA_FIELDS are clearly commented by locale and alphabetized within each group — a readability win over the prior layout.
  • The "Runtime-generated / attached fields" comment in PII_FIELDS is a useful distinction that wasn't previously documented; consider whether it belongs in the docstring of the module for broader discoverability, but not blocking.
  • AGENTS.md requires from __future__ import annotations and full type annotations; this PR doesn't touch typed code paths, so n/a.

Tests

  • Test updates are strictly count bumps (8 → 9) and membership assertions for ko_KR / fr_FR. Good, minimal, consistent with the existing style in these files.
  • The test_run_personas_with_all_flag test was previously missing an fr_FR assertion (it asserted the other 7 of 8 locales); this PR opportunistically fixes that gap. ✔
  • Gap: no test exercises the new ko_KR-specific fields (e.g., blood_pressure_status, family_persona). Given this is a data-only change and the existing test suite already validates the machinery generically, adding a ko_KR-specific case isn't strictly required, but a single smoke test that instantiates a Person sampler with locale="ko_KR" would catch schema-vs-field-list drift cheaply.

Docs

  • docs/concepts/person_sampling.md is consistent with the new field lists — section names match (Korea-Specific, Singapore-Specific, English Locales Shared, Religion Fields, India Locales Fields).
  • The new NGC download snippet for ko_KR follows the existing pattern. ✔
  • Renaming the India section header from "India-Specific Fields" → "India Locales Fields" and "Brazil and India Shared" → "Religion Fields" makes the grouping clearer now that en_SG joins the religion set.

Performance / Security

  • No performance implications — lists are small and evaluated at import.
  • No security implications — config and docs only.

Verdict

Approve with nits. This is a tight, well-scoped data/config update. The reorganization of PII_FIELDS / PERSONA_FIELDS is a clear improvement, tests and docs stay in sync with the code, and the indirection through NEMOTRON_PERSONAS_DATASET_SIZES means no other constants need touching. Main suggestion: verify the fr_FR size against the live NGC resource before merge, and consider a follow-up test that validates each managed locale's parquet schema against the declared field lists so future drift is caught automatically rather than manually.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR adds ko_KR (South Korea) as the 9th locale in the Nemotron-Personas dataset registry, updates PII_FIELDS/PERSONA_FIELDS to match current parquet schemas (adding Korea- and Singapore-specific fields, dropping removed fields, renaming digital_skillsdigital_skill), bumps the fr_FR dataset size, and reconciles the docs with the new field layout.

Confidence Score: 5/5

Safe to merge; all P2 findings are documentation gaps with no runtime impact.

Only P2 findings present (district undocumented after reclassification). All code logic, test counts, and registry entries are internally consistent. No correctness or functional bugs were identified.

docs/concepts/person_sampling.md — district field reclassified as universal in code but left undocumented in the universal fields table.

Important Files Changed

Filename Overview
docs/concepts/person_sampling.md Adds ko_KR locale, Singapore-specific, and English-shared field sections; regroups religion/India fields; drops removed fields (commune, departement, prefecture, digital_skills). Minor gap: district removed from India section but not added to universal table.
packages/data-designer-config/src/data_designer/config/utils/constants.py Adds ko_KR entry (2.66 GB) to NEMOTRON_PERSONAS_DATASET_SIZES and bumps fr_FR from 2.71 GB to 3.87 GB; straightforward registry update.
packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py Reorganises PII_FIELDS and PERSONA_FIELDS: adds ko_KR health/household fields and family_persona, adds en_SG-specific fields and ethnic_background, drops commune/departement/prefecture/digital_skills, renames digital_skills → digital_skill, promotes district to universal block.
packages/data-designer/tests/cli/repositories/test_persona_repository.py Bumps locale counts to 9, expands locale set assertion to include ko_KR, and adds ko_KR size/dataset-name test case.
packages/data-designer/tests/cli/controllers/test_download_controller.py Bumps expected locale count from 8 to 9, adds ko_KR and fr_FR to downloaded-locales assertions.
packages/data-designer/tests/cli/services/test_download_service.py Bumps expected locale count from 8 to 9 and adds ko_KR assertion; clean mechanical update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NEMOTRON_PERSONAS_DATASET_SIZES\nconstants.py] -->|9 locales incl. ko_KR| B[PersonaRepository\n_registry]
    B --> C[DownloadService\nget_available_locales]
    B --> D[DownloadController\n_determine_locales]
    C --> E[CLI download command]
    D --> E
    E -->|all_locales=True| F[NGC download\nfor each locale]
    G[PII_FIELDS / PERSONA_FIELDS\ndataset_based_person_fields.py] -->|field allow-list| H[Person sampler\nfilter/select columns]
    H --> I[Generated dataset\nwith locale-specific fields]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/person_sampling.md
Line: 238-244

Comment:
**`district` removed from docs but promoted to universal in code**

In `dataset_based_person_fields.py`, `district` was moved from the India-specific block into the "Universal demographic fields (present in every managed locale)" comment section. However, the doc update removes `district` from the **India Locales Fields** section without adding it to the universal fields table (lines ~181–192). After this PR `district` is not documented under any locale section.

If `district` is now truly universal, it should appear in the universal PII fields table. If it isn't universal, the code comment classification needs to be corrected.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "docs: add ko_KR to locale parameter list" | Re-trigger Greptile

@johnnygreco
Copy link
Copy Markdown
Contributor Author

Nemotron personas schema audit

I created a temporary audit workspace at .tmp_persona_schema_audit/ and checked packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py against every managed Nemotron Personas locale.

Result: PASS

  • All 9 managed locales were checked: en_US, en_IN, en_SG, fr_FR, hi_Deva_IN, hi_Latn_IN, ja_JP, ko_KR, pt_BR.
  • For every locale: missing=0, extra=0, schema_missing=0.
  • No duplicate entries in PII_FIELDS or PERSONA_FIELDS.
  • No overlap between PII_FIELDS and PERSONA_FIELDS.
  • No non-runtime schema fields are unused across the dataset set.
Locale Source Rows Expected Actual Missing Extra Schema missing
en_US NGC latest download 1,000,000 39 39 0 0 0
en_IN NGC latest download 1,000,000 48 48 0 0 0
en_SG installed managed assets 148,000 44 44 0 0 0
fr_FR installed managed assets 1,000,000 43 43 0 0 0
hi_Deva_IN installed managed assets 1,000,000 48 48 0 0 0
hi_Latn_IN NGC latest download 1,000,000 48 48 0 0 0
ja_JP NGC latest download 1,000,000 41 41 0 0 0
ko_KR installed managed assets 1,000,000 51 51 0 0 0
pt_BR installed managed assets 1,000,000 42 42 0 0 0

Runtime-generated fields were excluded from raw parquet exactness because they are added by DataDesigner at generation time: birth_date, email_address, locale, national_id, phone_number, state.

One caveat: PII_FIELDS and PERSONA_FIELDS are global allowlists, not executable per-locale schemas, so a single locale will naturally not contain fields belonging only to other locales. The per-locale exactness check passes.

@andreatgretel
Copy link
Copy Markdown
Contributor

docs/concepts/person_sampling.md:278

| `locale` | str | Language/region code - must be one of: "en_US", "en_IN", "en_SG", "fr_FR", "hi_Deva_IN", "hi_Latn_IN", "ja_JP", "pt_BR" |

ko_KR is missing from the locale list in this table - every other section in the doc was updated but this one got missed

@johnnygreco
Copy link
Copy Markdown
Contributor Author

Thanks @andreatgretel, good catch. I added ko_KR to the locale parameter table in docs/concepts/person_sampling.md and pushed the update in d8328f3.

Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - clean locale addition with thorough schema reconciliation. Johnny's parquet audit confirms all 9 locales match the updated field lists. Ship it.

@johnnygreco johnnygreco merged commit a65903e into main Apr 24, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants