Improve performance of API server when keep-balance retrieves collections
On our system, keep-balance currently takes 7-8 hours to retrieve all ~5.7M collections from an otherwise unloaded API server. When the system is also busy running crunch jobs, it can take closer to 10 hours. Indeed, the latest keep-balance run reported:
2016/09/08 20:53:16 GetCurrentState: took 9h59m9.299166541s
In comparison, all collection data can be dumped from the database (on a busy system) in less than five minutes:
root@humgen-01-01:~# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql
U arvados -W -h localhost arvados_production > /data/tmp/collections.txt 1 root root 13G Sep 9 00:00 /data/tmp/collections.txt
Password for user arvados:
root@humgen-01-01:~# ls -lh /data/tmp/collections.txt
root@humgen-01-01:~# wc -l /data/tmp/collections.txt
That's 35949s for the supported API server / keep-balance interaction for an underlying operation that can be completed in 279s. There is clearly some overhead involved in performing each SQL query (currently we are retrieving collections in batches of up to 10000), in serialising the results a json, and performing the http interactions for each batch, but I don't think that it is possible for those alone to cause the API server to add 12785% overhead on top of what the database query takes.
Tom pointed out that keep-balance does not need the authorisation signatures to be added to the manifest block locators, only the block locators themselves - he suggested that eliminating the extra signing operations would alleviate a large portion of the overhead. I think it would be worth trying.
We have 5.7M collections with a total of 6.5M distinct blocks on our system, but the collection manifests contain a total of 161M references to those blocks (i.e. we are making great use of deduplication). It looks like the loops in api/app/models/collection.rb go through each locator in each manifest in turn, signing each reference without any caching (https://github.com/curoverse/arvados/blob/7213d3096cdb5d5e03b559a04f88fcd22a835076/services/api/app/models/collection.rb#L220-L241), so if we can skip that step that by setting a flag in the query asking for unsigned manifest locators, in our case we'd be able to avoid 161M signing operations.
An extra ~8h would be explained by each of the 161M signing operations taking ~0.18ms, which seems within the realm of possibility.