Project

General

Profile

Actions

Bug #18941

closed

arv-mount/python SDK performance should be better

Added by Peter Amstutz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Story points:
-
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #18945: Review 18941-arv-prefetchResolvedPeter Amstutz03/30/2022Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Bug #18964: Write better prefetch testsNewActions
Actions #1

Updated by Peter Amstutz almost 3 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz almost 3 years ago

  • Subject changed from arv-mount performance should be better to arv-mount/python SDK performance should be better
Actions #4

Updated by Peter Amstutz almost 3 years ago

  • Release set to 46

I'm curious how the performance improves with a local keepstore server

Goofys uses a 400 MB readahead buffer with 20 MB chunks.

Actions #5

Updated by Peter Amstutz almost 3 years ago

18941-arv-prefetch @ d3051d45df5ee760a05a84bf4d385799c4326477

developer-run-tests: #3004

Found bugs in the prefetch logic:

When checking to see if a block was already in cache or should be prefetched it was:

  • using get_from_cache() which only returns true if the block is fully downloaded, but what we care about is whether there is an in-flight GET request or not.
  • using the signed identifier, but blocks are keyed on the bare md5 sum.

So, it would always think the block was not yet in cache, issuing numerous prefetch requests for the same block over and over, creating a traffic jam where all the prefetch threads were waiting on the same block instead of skipping duplicates and fetching the next required block.

Fixes:

  • Add a special "prefetch" option for KeepClient.get() which (a) starts downloading like normal if the block is not cached or (b) returns immediately if the block already has a cache entry
  • This ensures the prefetch requests (a) only initiate downloads for blocks that are not yet in flight (b) does not wait for in-flight requests to complete
  • Also removed the separate check to see if the block was in the cache; this could race with other threads and still result in more than one thread waiting on the same block fetch.
  • the number of prefetch 'get' threads is now settable, and the prefetch window size is based on the number of prefetch threads
  • arv-get has new --threads option, uses 4 threads by default
  • arv-mount sets the number of prefetch threads based on the keep cache size
  • cleanup: removed an unused "is_cached" method (which could not possibly have ever been used because it contains a fatal typo)
Actions #6

Updated by Peter Amstutz almost 3 years ago

1.7 GiB test file

Time to read whole file
arv-get 0m19.980s
arv-mount, default 256 MiB cache, local keepstore 0m25.396s
arv-mount, increased 384 MiB cache, local keepstore 0m22.354s

time arv-get --threads 4 pirca-4zz18-qwyvdfzx68qok1y/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex /dev/null

time dd if=keep/by_id/pirca-4zz18-qwyvdfzx68qok1y/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex of=/dev/nul

Reading the file includes loading and parsing the manifest and creating the Collection object.

With default settings, arv-mount is using 3 threads, but arv-get is using 4. They both use (by default) a 256 MiB cache, the difference is that arv-get is guaranteed to be doing a sequential read, whereas arv-mount is being slightly more conservative to account for other read patterns. We should consider adjusting the clusterwide default value "Container.DefaultKeepCacheRAM: 268435456" (256 Mib) up to 402653184 (384 MiB).

Ward observed that there seems to be a cache warming effect on the S3 side. I'm seeing this too. Reading a file, unmounting it, re-mounting it, then reading it again, the measured performance goes from 130 MB/s (first time) to 180 MB/s (subsequent trials).

Actions #7

Updated by Peter Amstutz almost 3 years ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2022-03-30 Sprint to 2022-04-13 Sprint
Actions #9

Updated by Peter Amstutz almost 3 years ago

I made one small change increasing max threads from 6 to 7 0a9d4fa5043bd7291611c41588bdd3f0b70ede44

Actions #10

Updated by Lucas Di Pentima almost 3 years ago

Some comments & questions:

  • "arv keep get" documentation may need updating at sdk/cli/subcommands.html to reflect the new parameter.
  • I think it would be good to have proper testing for this fix, not having it allowed this bug to exist for so long. OTOH if we need to ship this ASAP, it could be done in a separate branch, wdyt?
  • As talked at spring review, I think using the CPU core number as a factor for capping the max number of threads would be convenient. Maybe doing some manual tests on a big node to validate this would be desirable? If you have already done this test and it just works on arv-get, how about computing the default get threads depending on the cpu cores?

The rest LGTM

Actions #11

Updated by Peter Amstutz almost 3 years ago

Lucas Di Pentima wrote:

Some comments & questions:

  • "arv keep get" documentation may need updating at sdk/cli/subcommands.html to reflect the new parameter.

Will do.

  • I think it would be good to have proper testing for this fix, not having it allowed this bug to exist for so long. OTOH if we need to ship this ASAP, it could be done in a separate branch, wdyt?

Yes, we need to ship it.

It's very hard to write a test for. There was already a test that the very basic prefetch behavior was happening, the problem is KeepClient.get() was stubbed out. Since the core of the issue was multiple threads waiting on KeepClient.get() of the same block, the use of a stub would have still hidden the problem. I'll have to think about it.

  • As talked at spring review, I think using the CPU core number as a factor for capping the max number of threads would be convenient. Maybe doing some manual tests on a big node to validate this would be desirable? If you have already done this test and it just works on arv-get, how about computing the default get threads depending on the cpu cores?

Right now I don't have enough information. My theory is that because of the Python global interpreter lock and the additional work that the FUSE driver does (compared to arv-get), twice as many cores would have twice the lock contention, which would be worse, not better. I don't want to leave it uncapped or use a formula for which I don't have any evidence that it'll actually make it better.

The rest LGTM

Thanks.

Actions #12

Updated by Peter Amstutz almost 3 years ago

Benchmark using dd if=(25 GB keep file) of=/dev/null

2.3.3

https://workbench.pirca.arvadosapi.com/container_requests/pirca-xvhdp-amqvijqbdoszk83#Log

2022-03-30T20:47:17.361472192Z 26691600151 bytes (27 GB, 25 GiB) copied, 666.32 s, 40.1 MB/s

2.4.0~rc1

On a 2 core node:

https://workbench.pirca.arvadosapi.com/container_requests/pirca-xvhdp-a99km88q375xpiy#Log

2022-03-30T21:02:42.039596608Z 26691600151 bytes (27 GB, 25 GiB) copied, 222.505 s, 120 MB/s

On a 4 core node:

https://workbench.pirca.arvadosapi.com/container_requests/pirca-xvhdp-yx8wmx0qddaqxmq#Log

2022-03-30T21:06:55.824180944Z 26691600151 bytes (27 GB, 25 GiB) copied, 152.218 s, 175 MB/s

On a 8 core node:

https://workbench.pirca.arvadosapi.com/container_requests/pirca-xvhdp-30vfuzz3x60j3if#Log

2022-03-30T21:12:09.440730978Z 26691600151 bytes (27 GB, 25 GiB) copied, 137.416 s, 194 MB/s

Actions #13

Updated by Peter Amstutz almost 3 years ago

  • Related to Bug #18964: Write better prefetch tests added
Actions #14

Updated by Peter Amstutz almost 3 years ago

Added separate story for testing #18964

Actions #15

Updated by Peter Amstutz almost 3 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF