-
Notifications
You must be signed in to change notification settings - Fork 29
Add JSON and Language validators for IFEval #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: submission-v6.0
Are you sure you want to change the base?
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
|
This PR should finalize the IFEval implementation, potentially closing #1060 |
|
Please compare with the original ifval implementation to make sure that we can get similar evaluation results. |
I've created and run tests for both json and language, the short version is that this implementation exceeds the example implementation in both accuracy and performance, for both scenarios. Language DetectionI used a small executable that simply ran the language detection code using stdin as input and stdout as output, you put in text, and it outputs language code. It's worth mentioning that this implementation uses CLD2, whereas the example uses python's langdetect. As a dataset I used papluca/language-identification's 🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerc
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [00:18<00:00, 548.32it/s]
Result: 9674/10001 - 96.73%
[ble: elapsed 18.332s (CPU 105.3%)] ./testermain ./test.csv ./testerc
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testinglang 🭬 ./testermain ./test.csv ./testerp
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10001/10001 [57:37<00:00, 2.89it/s]
Result: 9311/10001 - 93.1%
[ble: elapsed 57m37s (CPU 99.2%)] ./testermain ./test.csv ./testerp
As you can see, our implementation was ~4% more accurate, which translates to 363/10001 cases. JSON ParsingI used the same testing framework, but with a different script to fit the different input. This implementation uses Crow's JSON parser, while the example simply uses python's built in JSON library. The dataset used was taken from Nicolas Seriot's work, specifically the JSON parsing test suite. The raw results are as follows: 🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerc ./test_parsing
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_string_unescaped_ctrl_char.json 0 FAIL
n_string_unescaped_newline.json 0 FAIL
n_string_unescaped_tab.json 0 FAIL
./testermain.sh: line 16: 1160837 Segmentation fault (core dumped) "$cmd" < "$file"
n_structure_100000_opening_arrays.json 139 CRASH
./testermain.sh: line 16: 1160894 Segmentation fault (core dumped) "$cmd" < "$file"
n_structure_open_array_object.json 139 CRASH
n_structure_whitespace_formfeed.json 0 FAIL
Fail states: 8
Run time: 1310 ms
🭨🛉 farook 🭬🖵 kesag 🭬🗀 testingjsonparse 🭬 ./testermain.sh ./testerp ./test_parsing
i_string_invalid_utf-8.json 1 FAIL
i_string_iso_latin_1.json 1 FAIL
i_string_lone_utf8_continuation_byte.json 1 FAIL
i_string_not_in_unicode_range.json 1 FAIL
i_string_overlong_sequence_2_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes.json 1 FAIL
i_string_overlong_sequence_6_bytes_null.json 1 FAIL
i_string_truncated-utf-8.json 1 FAIL
i_string_utf16BE_no_BOM.json 1 FAIL
i_string_utf16LE_no_BOM.json 1 FAIL
i_string_UTF-16LE_with_BOM.json 1 FAIL
i_string_UTF-8_invalid_sequence.json 1 FAIL
i_string_UTF8_surrogate_U+D800.json 1 FAIL
i_structure_UTF-8_BOM_empty_object.json 1 FAIL
n_number_infinity.json 0 FAIL
n_number_minus_infinity.json 0 FAIL
n_number_NaN.json 0 FAIL
Fail states: 17
Run time: 9465 ms
There are a few notable details, First is that the output only shows failures, the actual dataset contained 318 values to be tested. Second is that any test starting with Third is that the only fail state in this implementation that isn't in the python implementation is one where the input was where the parser was supposed to fail at parsing Finally, performance results here are similar, with this implementation taking 1.3seconds as opposed to 9.5 seconds, which is approximately 7 times faster. please let me know if I missed anything or if you need more info or the source code for the tests. |
|
please use google's input_response_data_gpt4_20231107_145030.jsonl |



For language detection, I used CLD2, Which while I know is old, it seems to be a simple, reliable, zero-dependency library, which I was easily able to build, test, and integrate.
For JSON validation, I used a modified part of Crow's JSON implementation. The reason behind using this as opposed to something like nlohmann's implementation, is that this one is both simpler and faster (up to something between 8x and 12x), which is much more suitable for this current use than a bulkier, more end-user friendly library. This is putting aside my extensive familiarity with Crow's JSON.