Add JSON and Language validators for IFEval #1078

farook-edev · 2025-11-17T21:25:35Z

For language detection, I used CLD2, Which while I know is old, it seems to be a simple, reliable, zero-dependency library, which I was easily able to build, test, and integrate.

For JSON validation, I used a modified part of Crow's JSON implementation. The reason behind using this as opposed to something like nlohmann's implementation, is that this one is both simpler and faster (up to something between 8x and 12x), which is much more suitable for this current use than a bulkier, more end-user friendly library. This is putting aside my extensive familiarity with Crow's JSON.

github-actions · 2025-11-17T21:25:48Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sonarqubecloud · 2025-11-17T23:23:37Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

farook-edev · 2025-11-18T00:43:44Z

This PR should finalize the IFEval implementation, potentially closing #1060

freedomtan · 2025-11-18T06:59:43Z

Please compare with the original ifval implementation to make sure that we can get similar evaluation results.

farook-edev · 2025-11-24T05:08:22Z

Please compare with the original ifval implementation to make sure that we can get similar evaluation results.

I've created and run tests for both json and language, the short version is that this implementation exceeds the example implementation in both accuracy and performance, for both scenarios.

Language Detection

I used a small executable that simply ran the language detection code using stdin as input and stdout as output, you put in text, and it outputs language code.

It's worth mentioning that this implementation uses CLD2, whereas the example uses python's langdetect.

As a dataset I used papluca/language-identification's test set, here are the raw results:

🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerc
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [00:18<00:00, 548.32it/s]
Result: 9674/10001 - 96.73%
[ble: elapsed 18.332s (CPU 105.3%)] ./testermain ./test.csv ./testerc
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerp
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [57:37<00:00,  2.89it/s]
Result: 9311/10001 - 93.1%
[ble: elapsed 57m37s (CPU 99.2%)] ./testermain ./test.csv ./testerp

testerc above refers to this implementation (C++) whereas testerp refers to the original (python) implementation.

As you can see, our implementation was ~4% more accurate, which translates to 363/10001 cases.
Additionally, our implementation was ~190 times faster that the original, taking 18 seconds for the entire dataset, as opposed to 1 hour.

JSON Parsing

I used the same testing framework, but with a different script to fit the different input.

This implementation uses Crow's JSON parser, while the example simply uses python's built in JSON library.

The dataset used was taken from Nicolas Seriot's work, specifically the JSON parsing test suite.

The raw results are as follows:

🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerc ./test_parsing
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_string_unescaped_ctrl_char.json 0 FAIL
n_string_unescaped_newline.json 0 FAIL
n_string_unescaped_tab.json 0 FAIL
./testermain.sh: line 16: 1160837 Segmentation fault         (core dumped) "$cmd" < "$file"
n_structure_100000_opening_arrays.json 139 CRASH
./testermain.sh: line 16: 1160894 Segmentation fault         (core dumped) "$cmd" < "$file"
n_structure_open_array_object.json 139 CRASH
n_structure_whitespace_formfeed.json 0 FAIL
Fail states: 8
Run time: 1310 ms
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerp ./test_parsing
i_string_invalid_utf-8.json 1 FAIL
i_string_iso_latin_1.json 1 FAIL
i_string_lone_utf8_continuation_byte.json 1 FAIL
i_string_not_in_unicode_range.json 1 FAIL
i_string_overlong_sequence_2_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes_null.json 1 FAIL
i_string_truncated-utf-8.json 1 FAIL
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_string_UTF-8_invalid_sequence.json 1 FAIL
i_string_UTF8_surrogate_U+D800.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_number_infinity.json 0 FAIL
n_number_minus_infinity.json 0 FAIL
n_number_NaN.json 0 FAIL
Fail states: 17
Run time: 9465 ms

testerc and testerp are the same as before.

There are a few notable details, First is that the output only shows failures, the actual dataset contained 318 values to be tested.

Second is that any test starting with i_ or y_ is expected to pass, whereas ones starting with n_ are expected to fail (crashing is considered a failure).

Third is that the only fail state in this implementation that isn't in the python implementation is one where the input was where the parser was supposed to fail at parsing [], which I don't know if it's even invalid JSON.

Finally, performance results here are similar, with this implementation taking 1.3seconds as opposed to 9.5 seconds, which is approximately 7 times faster.

please let me know if I missed anything or if you need more info or the source code for the tests.

freedomtan · 2025-11-25T06:21:23Z

please use google's input_response_data_gpt4_20231107_145030.jsonl

added json and language validators for IFEval

97886ac

farook-edev requested review from a team and anhappdev as code owners November 17, 2025 21:25

formatting

4727d69

farook-edev linked an issue Nov 18, 2025 that may be closed by this pull request

LLM IFEval Dataset Implementation #1060

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add JSON and Language validators for IFEval #1078

Add JSON and Language validators for IFEval #1078

farook-edev commented Nov 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Nov 17, 2025

Uh oh!

farook-edev commented Nov 18, 2025

Uh oh!

freedomtan commented Nov 18, 2025

Uh oh!

farook-edev commented Nov 24, 2025

Uh oh!

freedomtan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add JSON and Language validators for IFEval #1078

Are you sure you want to change the base?

Add JSON and Language validators for IFEval #1078

Conversation

farook-edev commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Nov 17, 2025

Quality Gate passed

Uh oh!

farook-edev commented Nov 18, 2025

Uh oh!

freedomtan commented Nov 18, 2025

Uh oh!

farook-edev commented Nov 24, 2025

Language Detection

JSON Parsing

Uh oh!

freedomtan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

farook-edev commented Nov 17, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited

Loading