Feature #12244

API server bulk transfers for keep-balance collection retrieval

Added by Joshua Randall 8 months ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


In Issue 9998 (https://dev.arvados.org/issues/9998) we managed to improve performance of retrieving batches of collections via the API server by close to 50% by eliminating unnecessary steps in the processing of each collection record. However, retrieving data directly from postgres remains over 100x faster than it is via the API server, even after those improvements.

A full cycle of keep-balance on our system today takes ~13h (10M+ collections), while the data can be retrieved from postgres in 6.5 minutes:

# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql -U arvados -w -h localhost arvados_production > /data/tmp/collections.dump

real    6m28.068s
user    0m46.619s
sys     0m45.584s
# wc -l /data/tmp/collections.dump
10075589 /data/tmp/collections.dump

I'd imagine no matter what there will still be some overhead associated with going through the API rather than doing a database table dump, but I suspect a bulk transfer API of some kind which does not involve an ORM could potentially get the cycle time for keep-balance down to less than 15m (so, 50x faster than currently). I think that would be worth doing.

Also available in: Atom PDF