Actions
Idea #3640
closed[SDKs] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache.
Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
-
Start date:
Due date:
Story points:
2.0
Description
Background:
arv-mount has a block cache, which improves performance when the same blocks are read multiple times. However:- Currently a new arv-mount process is started for each Crunch task execution. This means tasks don't share a cache, even if they're running at the same time.
- In the common case where multiple crunch tasks run at the same time and use the same data, we have multiple arv-mount processes each retrieving and caching its own copy of the same data blocks.
- Use large swap on worker nodes (preferably SSD). (We already do this for other reasons.)
- Set up a large tmpfs on worker nodes and use it as crunch job scratch space. (This already gets cleared at the beginning of a job to avoid leakage between jobs/users.)
- Use a directory in that tmpfs as an arv-mount cache. This makes it feasible to use a large cache size, and makes it easy to share the cache between multiple arv-mount processes.
- Rely on unix permissions for cache privacy. (Warn if the cache dir's
mode & 0007 != 0
, but go ahead anyway: there will be cases where that would be useful and not dangerous.) - Use flock() to avoid races and duplicated effort. (If arv-mount 1 is writing a block to the cache, then arv-mount 2 should wait for arv-mount 1 to finish then read from the cache, rather than fetch its own copy.)
- Do not clean up cache dir at start/exit, at least by default (the general idea is to share with past/future arv-mount procs). An optional
--cache-clear-atexit
flag would be nice to have. - Measuring/limiting cache size could be interesting
- Delete & replace upon finding a corrupt/truncated cache entry
- The default Keep mount on shell nodes should use a filesystem cache, assuming there is an appropriate filesystem for it (i.e., something faster than network: tmpfs, SSD, or at least a disk with async/barriers=0).
- crunch-job should create a per-job temp dir on each node during the "install" phase, and point all arv-mount processes to it.
Updated by Tom Clegg over 10 years ago
- Description updated (diff)
- Category set to Keep
Updated by Tom Clegg over 10 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Clegg over 10 years ago
- Subject changed from [FUSE] Add runtime option to use a filesystem directory block cache as an alternative to RAM cache. to [FUSE] Add runtime option to arv-mount to use a filesystem directory block cache as an alternative to RAM cache.
Updated by Tom Clegg over 10 years ago
- Subject changed from [FUSE] Add runtime option to arv-mount to use a filesystem directory block cache as an alternative to RAM cache. to [FUSE] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache.
Updated by Tom Clegg over 10 years ago
- Subject changed from [FUSE] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache. to [SDKs] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache.
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Peter Amstutz almost 3 years ago
- Has duplicate Feature #18842: Local disk keep cache for Python SDK/arv-mount added
Updated by Peter Amstutz over 2 years ago
Implementation brainstorm.
Build this feature around mmap()
When fetching a block, first check the memory cache, then check the disk cache, then fetch it from keep
- When fetching a block from keep, keep it in memory and start asynchronously writing it out to disk
- We want to be able to serve reads immediately without waiting for the disk cache machinery
- Once it has been written to disk it can be ejected from the memory cache
- When we find a block in the disk cache, open it and use mmap(), this gives us something that behaves like a memory buffer
- Separately keep track of open file descriptors and close ones that haven't been used recently
- Separately keep track of space used by blocks on disk and delete least recently used ones
- existing code for reassembling files from blocks mostly doesn't have to change
- avoid making a read() syscall in the happy case (no page fault)
- able to leverage the kernel's filesystem cache to balance between user process memory & cache memory
- the file that's just been written and then re-opened might even still be in the file system cache, which may even avoid blocking on disk activity
- can have a much larger default cache, users don't have to think about the Arvados cache
Actions