-
Notifications
You must be signed in to change notification settings - Fork 41
Batch Workbench Uploads #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Batch Workbench Uploads #7578
Conversation
|
Did some profiling to determine which parts of the upload/validate pipeline is taking the most time for each row. Here are the timing results done on the first 1000 rows of the cash upload dataset.:
After adding caching for apply_scoping, the new timing results were
So, that helps us get about a 2x improvement, from about 500 rows per minute to 1000 rows per minute. Trying to get 5x to 10x improvements if possible. Working now on speeding up the sections in the process_row function. It's not lending itself well to batching, so exploring multiple solutions and adding more fine grained profiling. |
|
Added code that can use bulk_insert on applicable rows. The full validation of the cash Workbench dataset of 321,216 records took 50 minutes to complete. So, in terms of rows-per-minute, we've gone from 500 to 1,000, and now to about 6,000. We've roughing got a 10x speed increase in the cash example. Still need to look into which types are rows can be used in bulk_insert, and which should not be to avoid possible issues. Also looking into implementing bulk_update for other situations. There is also some possible speedups that might work for the binding and matching sections of the code. |
This reverts commit cfca16c.
|
Anyone who has been doing conversions or working with a large workbench validations that takes a while to run, go ahead and try running the validation on this branch and see if the results are the same and if there is a speed up. If you can, put the workbench data on google drive so any issues can be recreated and debugged. Thanks. Don't plan on merges this branch for 7.12. It doesn't seem to work with batch edit. So, we'll likely just use this branch internally when needed. |
combs-a
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial Testing:
- Run the validation process on workbench data that you know takes a decent amount of time to validate. Run the validation in this branch and in the main branch for comparison.
- See if there was a speed up in the validation time compared to the main branch.
- See if the validation results looks the same as the validation results in the main branch
I haven't gotten a proper comparison or anything just yet, but did experience this error when uploading localities:
Maybe it's not running the business rule that sets SrcLatLongUnit to 0 when it isn't present anymore; issue is not present on main. Grant linked the locality business rules, so you can check the exact business rule.
Seems like both of them were around the same amount of time validating as well, since it was about 30 minutes for the test branch, and estimated the same for main.
It took around 20 minutes in main and the test branch.
Record set used was a set of about 52000 localities, can check with a collection object set as well. If the dataset needs to be bigger for testing, let me know--this seemed big enough!
Fixes #7577
Edit the workbench upload code to batch rows together used in workbench uploads and validation. This is intended to speedup workbench uploads and validations. Will need to continue tuning the batch size that seems to be optimal for workbench uploads. Also, still working on adjusting the progress bar update for the batch upload code.
A good test case that has been causing problems with slow uploads is this one here: https://drive.google.com/file/d/1Mpr_KWMkCY74_yZv_knXiNGeG6TSKCYk/view?usp=drive_link
There are many fields in this file, here is a data mapping I made for testing purposes: https://drive.google.com/file/d/1eo56GKwGbMXV7luGD_SJ24b-ADxFb53X/view?usp=drive_link
You can generate large mineral data sets (which can be easily adapted to other disciplines) using this script: https://github.com/specify/data-management/blob/main/demo-data-generator/generate.py
You can set the range here:
https://github.com/specify/data-management/blob/57959833499a7f838a4ff28444c932ac4c966288/demo-data-generator/generate.py#L50
Generate it just by running:
Checklist
self-explanatory (or properly documented)
Testing instructions
Initial Testing:
Further Testing: