Skip to content

Conversation

@farook-edev
Copy link
Contributor

@farook-edev farook-edev commented Nov 17, 2025

For language detection, I used CLD2, Which while I know is old, it seems to be a simple, reliable, zero-dependency library, which I was easily able to build, test, and integrate.

For JSON validation, I used a modified part of Crow's JSON implementation. The reason behind using this as opposed to something like nlohmann's implementation, is that this one is both simpler and faster (up to something between 8x and 12x), which is much more suitable for this current use than a bulkier, more end-user friendly library. This is putting aside my extensive familiarity with Crow's JSON.

@farook-edev farook-edev requested review from a team and anhappdev as code owners November 17, 2025 21:25
@github-actions
Copy link

github-actions bot commented Nov 17, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sonarqubecloud
Copy link

@farook-edev
Copy link
Contributor Author

This PR should finalize the IFEval implementation, potentially closing #1060

@farook-edev farook-edev linked an issue Nov 18, 2025 that may be closed by this pull request
5 tasks
@freedomtan
Copy link
Contributor

Please compare with the original ifval implementation to make sure that we can get similar evaluation results.

@farook-edev
Copy link
Contributor Author

Please compare with the original ifval implementation to make sure that we can get similar evaluation results.

I've created and run tests for both json and language, the short version is that this implementation exceeds the example implementation in both accuracy and performance, for both scenarios.


Language Detection

I used a small executable that simply ran the language detection code using stdin as input and stdout as output, you put in text, and it outputs language code.

It's worth mentioning that this implementation uses CLD2, whereas the example uses python's langdetect.

As a dataset I used papluca/language-identification's test set, here are the raw results:

🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerc
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [00:18<00:00, 548.32it/s]
Result: 9674/10001 - 96.73%
[ble: elapsed 18.332s (CPU 105.3%)] ./testermain ./test.csv ./testerc
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerp
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [57:37<00:00,  2.89it/s]
Result: 9311/10001 - 93.1%
[ble: elapsed 57m37s (CPU 99.2%)] ./testermain ./test.csv ./testerp

testerc above refers to this implementation (C++) whereas testerp refers to the original (python) implementation.

As you can see, our implementation was ~4% more accurate, which translates to 363/10001 cases.
Additionally, our implementation was ~190 times faster that the original, taking 18 seconds for the entire dataset, as opposed to 1 hour.

JSON Parsing

I used the same testing framework, but with a different script to fit the different input.

This implementation uses Crow's JSON parser, while the example simply uses python's built in JSON library.

The dataset used was taken from Nicolas Seriot's work, specifically the JSON parsing test suite.

The raw results are as follows:

🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerc ./test_parsing
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_string_unescaped_ctrl_char.json 0 FAIL
n_string_unescaped_newline.json 0 FAIL
n_string_unescaped_tab.json 0 FAIL
./testermain.sh: line 16: 1160837 Segmentation fault         (core dumped) "$cmd" < "$file"
n_structure_100000_opening_arrays.json 139 CRASH
./testermain.sh: line 16: 1160894 Segmentation fault         (core dumped) "$cmd" < "$file"
n_structure_open_array_object.json 139 CRASH
n_structure_whitespace_formfeed.json 0 FAIL
Fail states: 8
Run time: 1310 ms
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerp ./test_parsing
i_string_invalid_utf-8.json 1 FAIL
i_string_iso_latin_1.json 1 FAIL
i_string_lone_utf8_continuation_byte.json 1 FAIL
i_string_not_in_unicode_range.json 1 FAIL
i_string_overlong_sequence_2_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes_null.json 1 FAIL
i_string_truncated-utf-8.json 1 FAIL
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_string_UTF-8_invalid_sequence.json 1 FAIL
i_string_UTF8_surrogate_U+D800.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_number_infinity.json 0 FAIL
n_number_minus_infinity.json 0 FAIL
n_number_NaN.json 0 FAIL
Fail states: 17
Run time: 9465 ms

testerc and testerp are the same as before.

There are a few notable details, First is that the output only shows failures, the actual dataset contained 318 values to be tested.

Second is that any test starting with i_ or y_ is expected to pass, whereas ones starting with n_ are expected to fail (crashing is considered a failure).

Third is that the only fail state in this implementation that isn't in the python implementation is one where the input was where the parser was supposed to fail at parsing [], which I don't know if it's even invalid JSON.

Finally, performance results here are similar, with this implementation taking 1.3seconds as opposed to 9.5 seconds, which is approximately 7 times faster.


please let me know if I missed anything or if you need more info or the source code for the tests.

@freedomtan
Copy link
Contributor

please use google's input_response_data_gpt4_20231107_145030.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM IFEval Dataset Implementation

3 participants