Project

General

Profile

Actions

Bug #9998

closed

Improve performance of API server when keep-balance retrieves collections

Added by Joshua Randall over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-

Description

On our system, keep-balance currently takes 7-8 hours to retrieve all ~5.7M collections from an otherwise unloaded API server. When the system is also busy running crunch jobs, it can take closer to 10 hours. Indeed, the latest keep-balance run reported:
```
2016/09/08 20:53:16 GetCurrentState: took 9h59m9.299166541s
```

In comparison, all collection data can be dumped from the database (on a busy system) in less than five minutes:
```
root@humgen-01-01:~# time echo "COPY (select * from collections) TO STDOUT (format text)" | psql U arvados -W -h localhost arvados_production > /data/tmp/collections.txt
Password for user arvados:
real 4m39.610s
user 0m31.554s
sys 0m29.065s
root@humgen-01-01:~# ls -lh /data/tmp/collections.txt
-rw-r--r-
1 root root 13G Sep 9 00:00 /data/tmp/collections.txt
root@humgen-01-01:~# wc -l /data/tmp/collections.txt
5785865 /data/tmp/collections.txt
```

That's 35949s for the supported API server / keep-balance interaction for an underlying operation that can be completed in 279s. There is clearly some overhead involved in performing each SQL query (currently we are retrieving collections in batches of up to 10000), in serialising the results a json, and performing the http interactions for each batch, but I don't think that it is possible for those alone to cause the API server to add 12785% overhead on top of what the database query takes.

Tom pointed out that keep-balance does not need the authorisation signatures to be added to the manifest block locators, only the block locators themselves - he suggested that eliminating the extra signing operations would alleviate a large portion of the overhead. I think it would be worth trying.

We have 5.7M collections with a total of 6.5M distinct blocks on our system, but the collection manifests contain a total of 161M references to those blocks (i.e. we are making great use of deduplication). It looks like the loops in api/app/models/collection.rb go through each locator in each manifest in turn, signing each reference without any caching (https://github.com/curoverse/arvados/blob/7213d3096cdb5d5e03b559a04f88fcd22a835076/services/api/app/models/collection.rb#L220-L241), so if we can skip that step that by setting a flag in the query asking for unsigned manifest locators, in our case we'd be able to avoid 161M signing operations.

An extra ~8h would be explained by each of the 161M signing operations taking ~0.18ms, which seems within the realm of possibility.


Files

9998-collection-retrieval-time-comparison.png (133 KB) 9998-collection-retrieval-time-comparison.png Timings from trials with and without new collection listing performance improvements Joshua Randall, 02/17/2017 12:16 PM

Subtasks 1 (0 open1 closed)

Task #10950: Review 9998-no-count-items-availableResolvedRadhika Chippada09/08/2016Actions

Related issues 3 (0 open3 closed)

Related to Arvados - Idea #6830: [API] [keep-balance] Option to return unsigned manifests from collections#indexResolvedActions
Related to Arvados - Bug #10517: [CLI] Default return fields should be consistent across SDKsResolvedJoshua Randall11/10/2016Actions
Related to Arvados - Bug #10521: [SDKs] [CLI] "arv collection list" retrieves manifest_text even if not explicitly selectedDuplicate11/11/2016Actions
Actions

Also available in: Atom PDF