Project

General

Profile

Actions

Bug #18376

closed

[keepstore] Avoid long-lived readdirent cookies in filesystem driver

Added by Tom Clegg about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
-
Release relationship:
Auto

Description

We have seen the current implementation of IndexTo fail (error reading "/data/keep/13b": readdirent /data/keep/13b: errno 523) when the underlying filesystem is NFS and the indexing operation takes over 4 hours. (Errno 523 is EBADCOOKIE in NFS.)

We can avoid relying unnecessarily on long-lived readdirent cookies by
  • doing open/readdir/close on the top-level directory, then open/readdir/close on each subdirectory (the current implementation indexes each subdirectory before calling readdirent on the top-level directory to get the next subdir)
  • calling ReadDir() to get DirEnt structs as quickly as possible, then calling lstat() to get sizes (the current implementation uses Readdir(), which interleaves calls to lstat() and readdirent())

Subtasks 2 (0 open2 closed)

Task #18386: Review 18376-nfs-readdirentResolvedLucas Di Pentima11/16/2021Actions
Task #18473: review 18376-nfs-readdirentResolvedTom Clegg11/16/2021Actions

Related issues

Related to Arvados - Bug #18547: [keep-balance] Avoid redundant indexing when multiple keepstore servers use a single NFS mountResolvedTom Clegg12/06/2021Actions
Blocks Arvados - Idea #18518: Release Arvados 2.3.2ResolvedPeter Amstutz12/06/2021Actions
Actions #1

Updated by Tom Clegg about 3 years ago

  • Description updated (diff)
Actions #3

Updated by Lucas Di Pentima about 3 years ago

This LGTM, thanks.

Actions #4

Updated by Tom Clegg about 3 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|153d9954cbe21a0e98bf5cf364898e2bc10fcabd.

Actions #5

Updated by Ward Vandewege about 3 years ago

  • Release set to 45
Actions #6

Updated by Tom Clegg about 3 years ago

  • Status changed from Resolved to In Progress

Problem persists. Maybe we need a retry loop to get through busy periods?

18376-nfs-readdirent @ f7278a4238a687ba4b8203417133bc9add5e166b -- developer-run-tests: #2808

Actions #7

Updated by Peter Amstutz almost 3 years ago

  • Release changed from 45 to 48
Actions #8

Updated by Tom Clegg almost 3 years ago

  • Target version changed from 2021-11-24 sprint to 2021-12-08 sprint
Actions #9

Updated by Tom Clegg almost 3 years ago

Likelihood of hitting this error appears to vary with load, so we might stop seeing it when #18547 is fixed. In the cluster in question, multiple keepstore processes on different nodes get directory indexes on the same NFS volume all at once.

Actions #10

Updated by Tom Clegg almost 3 years ago

  • Related to Bug #18547: [keep-balance] Avoid redundant indexing when multiple keepstore servers use a single NFS mount added
Actions #11

Updated by Lucas Di Pentima almost 3 years ago

Retry loop at f7278a4 LGTM. Thanks.

Actions #12

Updated by Peter Amstutz almost 3 years ago

Actions #13

Updated by Tom Clegg almost 3 years ago

  • Status changed from In Progress to Resolved

cherry-picked f7278a423 onto 2.3-dev as b008c44ea

Actions

Also available in: Atom PDF