Batch Workbench Uploads #7578

acwhite211 · 2025-12-02T06:56:52Z

Edit the workbench upload code to batch rows together used in workbench uploads and validation. This is intended to speedup workbench uploads and validations. Will need to continue tuning the batch size that seems to be optimal for workbench uploads. Also, still working on adjusting the progress bar update for the batch upload code.

A good test case that has been causing problems with slow uploads is this one here: https://drive.google.com/file/d/1Mpr_KWMkCY74_yZv_knXiNGeG6TSKCYk/view?usp=drive_link
There are many fields in this file, here is a data mapping I made for testing purposes: https://drive.google.com/file/d/1eo56GKwGbMXV7luGD_SJ24b-ADxFb53X/view?usp=drive_link

You can generate large mineral data sets (which can be easily adapted to other disciplines) using this script: https://github.com/specify/data-management/blob/main/demo-data-generator/generate.py

You can set the range here:
https://github.com/specify/data-management/blob/57959833499a7f838a4ff28444c932ac4c966288/demo-data-generator/generate.py#L50

Generate it just by running:

 python3 meteorite-generate.py

Checklist

Self-review the PR after opening it to make sure the changes look good and
self-explanatory (or properly documented)
Add relevant issue to release milestone
Add pr to documentation list

Testing instructions

Initial Testing:

Run the validation process on workbench data that you know takes a decent amount of time to validate. Run the validation in this branch and in the main branch for comparison.
See if there was a speed up in the validation time compared to the main branch.
See if the validation results looks the same as the validation results in the main branch

Further Testing:

Run workbench validation on a large workbench dataset, see that it finishes within a few minutes.
Run workbench upload on a large workbench dataset, see that it finishes within a few minutes.

acwhite211 · 2025-12-02T22:57:11Z

Did some profiling to determine which parts of the upload/validate pipeline is taking the most time for each row.

Here are the timing results done on the first 1000 rows of the cash upload dataset.:

Total: 124.39 s
apply_scoping: 59.18 s (~47.6%)
process_row: 58.70 s (~47.2%)
bind result: 5.13 s (~4.1%)

After adding caching for apply_scoping, the new timing results were

Total: 64.5897 s
process_row 58.57 s (90.7%)
bind result 4.84 s
apply_scoping 0.14 s

So, that helps us get about a 2x improvement, from about 500 rows per minute to 1000 rows per minute. Trying to get 5x to 10x improvements if possible.

Working now on speeding up the sections in the process_row function. It's not lending itself well to batching, so exploring multiple solutions and adding more fine grained profiling.

acwhite211 · 2025-12-03T22:52:37Z

Added code that can use bulk_insert on applicable rows. The full validation of the cash Workbench dataset of 321,216 records took 50 minutes to complete. So, in terms of rows-per-minute, we've gone from 500 to 1,000, and now to about 6,000. We've roughing got a 10x speed increase in the cash example. Still need to look into which types are rows can be used in bulk_insert, and which should not be to avoid possible issues. Also looking into implementing bulk_update for other situations. There is also some possible speedups that might work for the binding and matching sections of the code.

This reverts commit cfca16c.

acwhite211 · 2026-01-07T07:40:06Z

Anyone who has been doing conversions or working with a large workbench validations that takes a while to run, go ahead and try running the validation on this branch and see if the results are the same and if there is a speed up. If you can, put the workbench data on google drive so any issues can be recreated and debugged. Thanks.

Don't plan on merges this branch for 7.12. It doesn't seem to work with batch edit. So, we'll likely just use this branch internally when needed.

combs-a

Initial Testing:

Run the validation process on workbench data that you know takes a decent amount of time to validate. Run the validation in this branch and in the main branch for comparison.
See if there was a speed up in the validation time compared to the main branch.
See if the validation results looks the same as the validation results in the main branch

I haven't gotten a proper comparison or anything just yet, but did experience this error when uploading localities:

Maybe it's not running the business rule that sets SrcLatLongUnit to 0 when it isn't present anymore; issue is not present on main. Grant linked the locality business rules, so you can check the exact business rule.

Seems like both of them were around the same amount of time validating as well, since it was about 30 minutes for the test branch, and estimated the same for main.

It took around 20 minutes in main and the test branch.

Record set used was a set of about 52000 localities, can check with a collection object set as well. If the dataset needs to be bigger for testing, let me know--this seemed big enough!

acwhite211 added 3 commits November 25, 2025 16:22

initial batching of wb upload

7100d3f

improve wb progress status

be74388

batch row counting

ded2684

github-project-automation bot added this to General Tester Board Dec 2, 2025

github-project-automation bot moved this to 📋Back Log in General Tester Board Dec 2, 2025

add caching for scoping and add timings for profiling

083db97

use bulk_insert on applicable rows

d6eb648

acwhite211 added 8 commits December 5, 2025 11:15

avoid high memory usage with limiting the buffer size

52f7092

consolidate bulk creation code

2243be0

typing import in unit test

97d8e37

mypy fixes

b82170e

rest of mypy fixes

b1b4a26

validation and unit test fixing

cfca16c

Revert "validation and unit test fixing"

8595a1e

This reverts commit cfca16c.

cleanup diff

49957e7

acwhite211 requested review from combs-a, emenslin, grantfitzsimmons, lexiclevenger and melton-jason and removed request for emenslin January 7, 2026 07:41

CarolineDenis marked this pull request as ready for review January 9, 2026 17:55

combs-a requested changes Jan 15, 2026

View reviewed changes

github-project-automation bot moved this from 📋Back Log to Dev Attention Needed in General Tester Board Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Workbench Uploads #7578

Batch Workbench Uploads #7578

acwhite211 commented Dec 2, 2025 •

edited by grantfitzsimmons

Loading

Uh oh!

acwhite211 commented Dec 2, 2025

Uh oh!

acwhite211 commented Dec 3, 2025 •

edited

Loading

Uh oh!

acwhite211 commented Jan 7, 2026 •

edited

Loading

Uh oh!

combs-a left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Batch Workbench Uploads #7578

Are you sure you want to change the base?

Batch Workbench Uploads #7578

Conversation

acwhite211 commented Dec 2, 2025 • edited by grantfitzsimmons Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Testing instructions

Uh oh!

acwhite211 commented Dec 2, 2025

Uh oh!

acwhite211 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acwhite211 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

combs-a left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

acwhite211 commented Dec 2, 2025 •

edited by grantfitzsimmons

Loading

acwhite211 commented Dec 3, 2025 •

edited

Loading

acwhite211 commented Jan 7, 2026 •

edited

Loading