Project

General

Profile

Actions

Feature #12244

closed

API server bulk transfers for keep-balance collection retrieval

Added by Joshua Randall over 6 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
-
Story points:
-

Description

In Issue 9998 (https://dev.arvados.org/issues/9998) we managed to improve performance of retrieving batches of collections via the API server by close to 50% by eliminating unnecessary steps in the processing of each collection record. However, retrieving data directly from postgres remains over 100x faster than it is via the API server, even after those improvements.

A full cycle of keep-balance on our system today takes ~13h (10M+ collections), while the data can be retrieved from postgres in 6.5 minutes:

# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql -U arvados -w -h localhost arvados_production > /data/tmp/collections.dump

real    6m28.068s
user    0m46.619s
sys     0m45.584s
# wc -l /data/tmp/collections.dump
10075589 /data/tmp/collections.dump

I'd imagine no matter what there will still be some overhead associated with going through the API rather than doing a database table dump, but I suspect a bulk transfer API of some kind which does not involve an ORM could potentially get the cycle time for keep-balance down to less than 15m (so, 50x faster than currently). I think that would be worth doing.

Actions #1

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #2

Updated by Peter Amstutz 7 months ago

  • Status changed from New to Resolved

keep-balance now retrieves and updates records on Postgres directly

Actions #3

Updated by Peter Amstutz 7 months ago

  • Release deleted (60)
Actions

Also available in: Atom PDF