Bug #18376

[keepstore] Avoid long-lived readdirent cookies in filesystem driver

Added by Tom Clegg 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
11/16/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

We have seen the current implementation of IndexTo fail (error reading "/data/keep/13b": readdirent /data/keep/13b: errno 523) when the underlying filesystem is NFS and the indexing operation takes over 4 hours. (Errno 523 is EBADCOOKIE in NFS.)

We can avoid relying unnecessarily on long-lived readdirent cookies by
  • doing open/readdir/close on the top-level directory, then open/readdir/close on each subdirectory (the current implementation indexes each subdirectory before calling readdirent on the top-level directory to get the next subdir)
  • calling ReadDir() to get DirEnt structs as quickly as possible, then calling lstat() to get sizes (the current implementation uses Readdir(), which interleaves calls to lstat() and readdirent())

Subtasks

Task #18386: Review 18376-nfs-readdirentResolvedLucas Di Pentima

Task #18473: review 18376-nfs-readdirentResolvedTom Clegg


Related issues

Related to Arvados - Bug #18547: [keep-balance] Avoid redundant indexing when multiple keepstore servers use a single NFS mountResolved12/06/2021

Blocks Arvados - Story #18518: Release Arvados 2.3.2Resolved12/06/2021

History

#1 Updated by Tom Clegg 6 months ago

  • Description updated (diff)

#3 Updated by Lucas Di Pentima 6 months ago

This LGTM, thanks.

#4 Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|153d9954cbe21a0e98bf5cf364898e2bc10fcabd.

#5 Updated by Ward Vandewege 6 months ago

  • Release set to 45

#6 Updated by Tom Clegg 6 months ago

  • Status changed from Resolved to In Progress

Problem persists. Maybe we need a retry loop to get through busy periods?

18376-nfs-readdirent @ f7278a4238a687ba4b8203417133bc9add5e166b -- developer-run-tests: #2808

#7 Updated by Peter Amstutz 6 months ago

  • Release changed from 45 to 48

#8 Updated by Tom Clegg 6 months ago

  • Target version changed from 2021-11-24 sprint to 2021-12-08 sprint

#9 Updated by Tom Clegg 6 months ago

Likelihood of hitting this error appears to vary with load, so we might stop seeing it when #18547 is fixed. In the cluster in question, multiple keepstore processes on different nodes get directory indexes on the same NFS volume all at once.

#10 Updated by Tom Clegg 6 months ago

  • Related to Bug #18547: [keep-balance] Avoid redundant indexing when multiple keepstore servers use a single NFS mount added

#11 Updated by Lucas Di Pentima 6 months ago

Retry loop at f7278a4 LGTM. Thanks.

#12 Updated by Peter Amstutz 5 months ago

#13 Updated by Tom Clegg 5 months ago

  • Status changed from In Progress to Resolved

cherry-picked f7278a423 onto 2.3-dev as b008c44ea

Also available in: Atom PDF