chore: add ko_KR locale to nemotron personas datasets by johnnygreco · Pull Request #572 · NVIDIA-NeMo/DataDesigner

johnnygreco · 2026-04-24T16:54:20Z

📋 Summary

Add ko_KR (South Korea) to the Nemotron-Personas locale registry alongside the existing 8 locales, registering its dataset size (2.66 GB) and NGC resource name so users can download it via the CLI
Update PII_FIELDS / PERSONA_FIELDS in dataset_based_person_fields.py to reflect the fields actually present in the installed parquet schemas — adds ko_KR-specific fields (health indicators, household/economic status, family_persona), en_SG-specific fields (industry, preferred_english_name), shared ethnic_background (en_US/en_SG), and drops fields no longer produced (commune, departement, prefecture, digital_skills); digital_skill replaces digital_skills
Bump fr_FR dataset size to 3.87 GB to match the current NGC artifact
Reconcile docs/concepts/person_sampling.md persona and PII field tables with the new field layout (adds Korea-Specific and Singapore-Specific sections, regroups religion/India fields)
Update CLI download/repository/service tests to expect 9 locales and assert on ko_KR and fr_FR

Register Korean (ko_KR, 2.66 GB) as an available managed persona dataset locale, update related CLI/repository tests, and document the new locale and its NGC download command.

Remove stale per-locale fields that no longer exist in any managed parquet (commune, departement, prefecture), drop district from the India-specific section since it's already listed in Core Fields, rename digital_skills → digital_skill to match the actual ja_JP column, and add sections for ko_KR, en_SG, and the en_US/en_SG shared ethnic_background. Corrects the religion-family membership to include en_SG.

The test asserts all 9 locales were downloaded but only enumerates 8 in its per-locale checks — fr_FR has been missing since before the ko_KR addition. Align the enumeration with the count.

github-actions · 2026-04-24T16:55:21Z

Docs preview: https://08d71afb.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

github-actions · 2026-04-24T16:56:48Z

PR #572 Review — `chore: add ko_KR locale to nemotron personas datasets`

Summary

Adds the ko_KR (South Korea) locale to the Nemotron-Personas registry and reconciles the PII_FIELDS / PERSONA_FIELDS lists in dataset_based_person_fields.py with the fields actually present in the installed parquet schemas. Bumps the fr_FR dataset size from 2.71 GB → 3.87 GB, renames digital_skills → digital_skill, introduces Korea- and Singapore-specific field groups, drops commune / departement / prefecture / digital_skills, and updates three CLI test modules plus the person_sampling.md docs. Pure data/config + doc change — no engine logic is touched.

Findings

Correctness

LOCALES_WITH_MANAGED_DATASETS propagates automatically — it's derived from NEMOTRON_PERSONAS_DATASET_SIZES.keys() in constants.py:395, so adding ko_KR to the size map is sufficient; no other constant needs editing. ✔
fr_FR size change is a ~43% jump (2.71 GB → 3.87 GB). The PR description says this matches the current NGC artifact. Worth double-checking the NGC resource before merge — a stale size only affects the CLI download progress UX, but it's the kind of drift the runner-memory baselines are supposed to catch.
Field re-categorization is a behavior claim, not just a rename. The diff moves zone from "Japan-specific" to "India locales", moves district from "India-specific" to universal, and removes prefecture entirely. PII_FIELDS / PERSONA_FIELDS drive downstream filtering, so any locale whose parquet is missing a field now declared "universal" (e.g., district) will produce nulls or raise at generation time. Since the PR description says these lists were derived empirically from the installed schemas, this should be fine — but consider adding a lightweight regression test that iterates over LOCALES_WITH_MANAGED_DATASETS and asserts every universal field is present in each locale's parquet, so the next schema drift is caught automatically instead of manually.
digital_skills → digital_skill rename — grep shows no remaining references to digital_skills anywhere in the repo, so the rename is complete. ✔
No code references to removed fields (commune, departement, prefecture) remain outside the removed lines. ✔

Conventions / Style

New groupings in PII_FIELDS / PERSONA_FIELDS are clearly commented by locale and alphabetized within each group — a readability win over the prior layout.
The "Runtime-generated / attached fields" comment in PII_FIELDS is a useful distinction that wasn't previously documented; consider whether it belongs in the docstring of the module for broader discoverability, but not blocking.
AGENTS.md requires from __future__ import annotations and full type annotations; this PR doesn't touch typed code paths, so n/a.

Tests

Test updates are strictly count bumps (8 → 9) and membership assertions for ko_KR / fr_FR. Good, minimal, consistent with the existing style in these files.
The test_run_personas_with_all_flag test was previously missing an fr_FR assertion (it asserted the other 7 of 8 locales); this PR opportunistically fixes that gap. ✔
Gap: no test exercises the new ko_KR-specific fields (e.g., blood_pressure_status, family_persona). Given this is a data-only change and the existing test suite already validates the machinery generically, adding a ko_KR-specific case isn't strictly required, but a single smoke test that instantiates a Person sampler with locale="ko_KR" would catch schema-vs-field-list drift cheaply.

Docs

docs/concepts/person_sampling.md is consistent with the new field lists — section names match (Korea-Specific, Singapore-Specific, English Locales Shared, Religion Fields, India Locales Fields).
The new NGC download snippet for ko_KR follows the existing pattern. ✔
Renaming the India section header from "India-Specific Fields" → "India Locales Fields" and "Brazil and India Shared" → "Religion Fields" makes the grouping clearer now that en_SG joins the religion set.

Performance / Security

No performance implications — lists are small and evaluated at import.
No security implications — config and docs only.

Verdict

Approve with nits. This is a tight, well-scoped data/config update. The reorganization of PII_FIELDS / PERSONA_FIELDS is a clear improvement, tests and docs stay in sync with the code, and the indirection through NEMOTRON_PERSONAS_DATASET_SIZES means no other constants need touching. Main suggestion: verify the fr_FR size against the live NGC resource before merge, and consider a follow-up test that validates each managed locale's parquet schema against the declared field lists so future drift is caught automatically rather than manually.

greptile-apps · 2026-04-24T16:56:56Z

Greptile Summary

This PR adds ko_KR (South Korea) as the 9th locale in the Nemotron-Personas dataset registry, updates PII_FIELDS/PERSONA_FIELDS to match current parquet schemas (adding Korea- and Singapore-specific fields, dropping removed fields, renaming digital_skills → digital_skill), bumps the fr_FR dataset size, and reconciles the docs with the new field layout.

Confidence Score: 5/5

Safe to merge; all P2 findings are documentation gaps with no runtime impact.

Only P2 findings present (district undocumented after reclassification). All code logic, test counts, and registry entries are internally consistent. No correctness or functional bugs were identified.

docs/concepts/person_sampling.md — district field reclassified as universal in code but left undocumented in the universal fields table.

Important Files Changed

Filename	Overview
docs/concepts/person_sampling.md	Adds ko_KR locale, Singapore-specific, and English-shared field sections; regroups religion/India fields; drops removed fields (commune, departement, prefecture, digital_skills). Minor gap: `district` removed from India section but not added to universal table.
packages/data-designer-config/src/data_designer/config/utils/constants.py	Adds ko_KR entry (2.66 GB) to NEMOTRON_PERSONAS_DATASET_SIZES and bumps fr_FR from 2.71 GB to 3.87 GB; straightforward registry update.
packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py	Reorganises PII_FIELDS and PERSONA_FIELDS: adds ko_KR health/household fields and family_persona, adds en_SG-specific fields and ethnic_background, drops commune/departement/prefecture/digital_skills, renames digital_skills → digital_skill, promotes district to universal block.
packages/data-designer/tests/cli/repositories/test_persona_repository.py	Bumps locale counts to 9, expands locale set assertion to include ko_KR, and adds ko_KR size/dataset-name test case.
packages/data-designer/tests/cli/controllers/test_download_controller.py	Bumps expected locale count from 8 to 9, adds ko_KR and fr_FR to downloaded-locales assertions.
packages/data-designer/tests/cli/services/test_download_service.py	Bumps expected locale count from 8 to 9 and adds ko_KR assertion; clean mechanical update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NEMOTRON_PERSONAS_DATASET_SIZES\nconstants.py] -->|9 locales incl. ko_KR| B[PersonaRepository\n_registry]
    B --> C[DownloadService\nget_available_locales]
    B --> D[DownloadController\n_determine_locales]
    C --> E[CLI download command]
    D --> E
    E -->|all_locales=True| F[NGC download\nfor each locale]
    G[PII_FIELDS / PERSONA_FIELDS\ndataset_based_person_fields.py] -->|field allow-list| H[Person sampler\nfilter/select columns]
    H --> I[Generated dataset\nwith locale-specific fields]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/concepts/person_sampling.md
Line: 238-244

Comment:
**`district` removed from docs but promoted to universal in code**

In `dataset_based_person_fields.py`, `district` was moved from the India-specific block into the "Universal demographic fields (present in every managed locale)" comment section. However, the doc update removes `district` from the **India Locales Fields** section without adding it to the universal fields table (lines ~181–192). After this PR `district` is not documented under any locale section.

If `district` is now truly universal, it should appear in the universal PII fields table. If it isn't universal, the code comment classification needs to be corrected.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "docs: add ko_KR to locale parameter list" | Re-trigger Greptile}

johnnygreco · 2026-04-24T20:43:29Z

Nemotron personas schema audit

I created a temporary audit workspace at .tmp_persona_schema_audit/ and checked packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py against every managed Nemotron Personas locale.

Result: PASS

All 9 managed locales were checked: en_US, en_IN, en_SG, fr_FR, hi_Deva_IN, hi_Latn_IN, ja_JP, ko_KR, pt_BR.
For every locale: missing=0, extra=0, schema_missing=0.
No duplicate entries in PII_FIELDS or PERSONA_FIELDS.
No overlap between PII_FIELDS and PERSONA_FIELDS.
No non-runtime schema fields are unused across the dataset set.

Locale	Source	Rows	Expected	Actual
`en_US`	NGC latest download	1,000,000	39	39
`en_IN`	NGC latest download	1,000,000	48	48
`en_SG`	installed managed assets	148,000	44	44
`fr_FR`	installed managed assets	1,000,000	43	43
`hi_Deva_IN`	installed managed assets	1,000,000	48	48
`hi_Latn_IN`	NGC latest download	1,000,000	48	48
`ja_JP`	NGC latest download	1,000,000	41	41
`ko_KR`	installed managed assets	1,000,000	51	51
`pt_BR`	installed managed assets	1,000,000	42	42

Runtime-generated fields were excluded from raw parquet exactness because they are added by DataDesigner at generation time: birth_date, email_address, locale, national_id, phone_number, state.

One caveat: PII_FIELDS and PERSONA_FIELDS are global allowlists, not executable per-locale schemas, so a single locale will naturally not contain fields belonging only to other locales. The per-locale exactness check passes.

andreatgretel · 2026-04-24T21:01:11Z

docs/concepts/person_sampling.md:278

| `locale` | str | Language/region code - must be one of: "en_US", "en_IN", "en_SG", "fr_FR", "hi_Deva_IN", "hi_Latn_IN", "ja_JP", "pt_BR" |

ko_KR is missing from the locale list in this table - every other section in the doc was updated but this one got missed

johnnygreco · 2026-04-24T21:06:00Z

Thanks @andreatgretel, good catch. I added ko_KR to the locale parameter table in docs/concepts/person_sampling.md and pushed the update in d8328f3.

andreatgretel

Looks good - clean locale addition with thorough schema reconciliation. Johnny's parquet audit confirms all 9 locales match the updated field lists. Ship it.

johnnygreco added 5 commits April 24, 2026 10:42

chore: add ko_KR locale to nemotron personas datasets

f108805

Register Korean (ko_KR, 2.66 GB) as an available managed persona dataset locale, update related CLI/repository tests, and document the new locale and its NGC download command.

update person fields

678407c

update fr_FR size

e530ce2

test: add missing fr_FR assertion in test_run_personas_with_all_flag

ab81b21

The test asserts all 9 locales were downloaded but only enumerates 8 in its per-locale checks — fr_FR has been missing since before the ko_KR addition. Align the enumeration with the count.

johnnygreco requested a review from a team as a code owner April 24, 2026 16:54

johnnygreco temporarily deployed to agentic-ci April 24, 2026 16:54 — with GitHub Actions Inactive

docs: add ko_KR to locale parameter list

d8328f3

andreatgretel approved these changes Apr 24, 2026

View reviewed changes

johnnygreco merged commit a65903e into main Apr 24, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add ko_KR locale to nemotron personas datasets#572

chore: add ko_KR locale to nemotron personas datasets#572
johnnygreco merged 6 commits intomainfrom
johnny/chore/add-ko-kr-nemotron-personas-locale

johnnygreco commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

johnnygreco commented Apr 24, 2026

Uh oh!

andreatgretel commented Apr 24, 2026

Uh oh!

johnnygreco commented Apr 24, 2026

Uh oh!

andreatgretel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnnygreco commented Apr 24, 2026

📋 Summary

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026

PR #572 Review — chore: add ko_KR locale to nemotron personas datasets

Summary

Findings

Correctness

Conventions / Style

Tests

Docs

Performance / Security

Verdict

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

johnnygreco commented Apr 24, 2026

Nemotron personas schema audit

Uh oh!

andreatgretel commented Apr 24, 2026

Uh oh!

johnnygreco commented Apr 24, 2026

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 24, 2026 •

edited

Loading

PR #572 Review — `chore: add ko_KR locale to nemotron personas datasets`

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading