API server bulk transfers for keep-balance collection retrieval
In Issue 9998 (https://dev.arvados.org/issues/9998) we managed to improve performance of retrieving batches of collections via the API server by close to 50% by eliminating unnecessary steps in the processing of each collection record. However, retrieving data directly from postgres remains over 100x faster than it is via the API server, even after those improvements.
A full cycle of keep-balance on our system today takes ~13h (10M+ collections), while the data can be retrieved from postgres in 6.5 minutes:
# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql -U arvados -w -h localhost arvados_production > /data/tmp/collections.dump real 6m28.068s user 0m46.619s sys 0m45.584s # wc -l /data/tmp/collections.dump 10075589 /data/tmp/collections.dump
I'd imagine no matter what there will still be some overhead associated with going through the API rather than doing a database table dump, but I suspect a bulk transfer API of some kind which does not involve an ORM could potentially get the cycle time for keep-balance down to less than 15m (so, 50x faster than currently). I think that would be worth doing.