Idea #7824
Updated by Brett Smith about 9 years ago
h2. Original report In 1.5 hrs, 8MiB of a 55MiB file was downloaded using the command: arv keep get 215dd32873bfa002aa0387c6794e4b2c+54081534/tile.csv . A top on the computer running the "arv keep get" command results in: <pre> top - 19:47:07 up 2 days, 9:09, 8 users, load average: 1.12, 1.26, 1.32 Tasks: 223 total, 3 running, 217 sleeping, 0 stopped, 3 zombie %Cpu(s): 43.5 us, 8.7 sy, 0.0 ni, 47.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 15535256 total, 12281116 used, 3254140 free, 1069760 buffers KiB Swap: 15929340 total, 221892 used, 15707448 free. 5467732 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14366 sguthrie 20 0 2498672 2.173g 7204 R 100.0 14.7 98:02.16 arv-get </pre> Downloads from workbench on this collection generate a timeout before allowing the user to choose where to download the file. Story #7729 requires multiple downloads from this qr1hi collection (qr1hi-4zz18-wuld8y0z7qluw00) and ones with similarly large manifests. To unblock #7729 I would need one of: * A recipe that allows a user to alter the manifest to be well behaved * Faster downloads from collections with very large manifests Update by Ward: I investigated a bit while this was ongoing. There was no discernable extra load on keepproxy, or on the API server, or on Postgres while Sally's download was ongoing. But when I tried to run the command locally, after a while I saw arv-get suck up 100% cpu (one core) and peak ram at 3GiB (resident!) until I killed it. h2. Fix Update arv-get to get files from collections using the Python file API, which is better optimized in the SDK than the old CollectionReader API. See the code in note 3 #7824-3 for the basic gist.