-
Notifications
You must be signed in to change notification settings - Fork 1
Optimize /api/importSourceLocations #583
Comments
One idea is to have our clients calculate hashes for every imported item, store those in VIAL, and provide a supported mechanism for those clients to maintain their own cache of those hashes - so they can avoid sending us data that we already have. |
There is definitely low-hanging fruit within the endpoint itself though: Lines 237 to 324 in c33b0d0
|
I am going to test out doing content hashes within ingestion by downloading all loaded source locations, calculating a content hash e.g. sorted keys with no timestamp fields, and then skipping incoming locations with the same hash value. If this plan works than I would like a place to store these content hashes on |
Suggestion: instead of creating content hashes in the application layer, add server logic to leverage Postgres' ETL capability, which basically amounts to: -- Step 1: temp table with all uploaded data
-- Modify /api/importSourceLocations to insert
-- to this temp table in a streaming fashion
-- (very low memory overhead)
create temp table tmp_new_data as
...;
begin; -- start transactions as late as possible
-- Step 2: update rows that have changed
update destination_table d
set
col_a = t.col_a,
col_b = t.col_b,
...
from tmp_new_data t
where
t.pkey_col_1 = d.pkey_col1
and ...
and (t.col_a, t.col_b, ...) IS DISTINCT FROM (d.col_a, d.col_b, ...);
-- Step 3: insert new rows
insert into destination_table d
select * from tmp_new_data t
where not exists (
select 1 from tmp_new_data t
where t.pkey_col_1 = d.pkey_col1 ...
);
commit;
-- Step 4: clean up
drop table tmp_new_data; Sorry for "drive-by" design, but I'm happy to collaborate more on this. I've written a bunch of ETL in past lives, and this seems like basically that. |
The idea is that Postgres (and RBMSes in general) do(es) really well when operated in bulk fashion (as opposed to inserting/updating one row at a time), and the above suggestion does exactly that. |
@shashank025 The idea of bulk loading to a temp table and then selectively copying to the main table is a great idea. This would speed up bulk loading in general. We would probably still want content hashes though to avoid having to do multiple comparisons for each field, especially since we would need to do those comparisons to fields inside of a JSON field. If VIAL authors have time they can look into bypassing the ORM to try this temp table load technique that would be great. When i have done this in the past I use psycopg2's copy_expert to stream data directly from a fileobj to temp table which had a low memory overhead, but that is may be overkill for the <10MB files we have. In the meantime, here is how it looks when we calculate content hashes in ingestion, I think this solves the broken load problem well enough for now: CAVaccineInventory/vaccine-feed-ingest#661 |
The problem with dropping down for raw SQL to this is that we then need to keep that SQL in sync with further changes we make to the database models - operating via the ORM protects us from having to think about that. Keeping in sync isn't impossible though: with comprehensive automated tests around this I'm confident we could at least cause easily fixed test failures should the raw SQL optimization be broken by any future changes. It's a complex enough solution though that I'd prefer to avoid it unless we urgently need it. |
@simonw Ingest implemented content hashs, and that allowed the load to finish successfully in 4.5 hours. This is crucially under the 5 hour deadline! However, this is still too close to the deadline for my comfort, so I think we need further optimization to Another option to increase performance is you could skip Pydantic validation of |
OK, sounds like we do need to make some improvements here. I'll look for some low hanging fruit. |
Recent Looks to me like the main problem here is we're executing hundreds (if not thousands) of tiny 3ms queries. |
The "2 minutes for 500 locations" includes encoding and transfer time on ingestion side. I can look into optimizing that like using |
Ah, I also misinterpreted the logs! This means we are loading
|
It gets hit with 160,000 writes (batched 500 at a time) every six hours and it hasn't had any optimization work done at all.
Discord: https://discord.com/channels/799147121357881364/824443781990187008/843889279763218452 and https://discord.com/channels/799147121357881364/813861006718926848/843934864397828156
The text was updated successfully, but these errors were encountered: