Project

General

Profile

Actions

Bug #9732

open

[FUSE] performance issues stat() in python

Added by Nico César over 7 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Hi Curoverse,

we are seeing some performance issues accessing arv fuser mounts with python

from one of our users:

I noticed that - when mounting arvados keep - running os.stat is very slow.
I am running a workflow that that has 18 BAM files as input, each about 60Gb,
and a reference sequence of 3Gb.

The time it takes to build a workflow is 6 mins, almost all of this time is spent in calls to os.stat. To give an example:

  1. 2016-08-02 13:46:18,118 DEBUG DB: collected 0.368199110031 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxx.fa
  2. 2016-08-02 13:46:24,087 DEBUG DB: collected 5.96899294853 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxxxxxxxxxx.bam
  3. 2016-08-02 13:46:25,408 DEBUG DB: collected 1.32038497925 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxxxxxxxxxxx.fa
  4. 2016-08-02 13:46:37,359 DEBUG DB: collected 11.9504730701 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxxxxxxxxxxx.bam
  5. 2016-08-02 13:46:38,718 DEBUG DB: collected 1.35802483559 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxxxxxxxxxxx.fa
  6. 2016-08-02 13:46:50,784 DEBUG DB: collected 12.065251112 /local/scratch/benchmark/ctmpBa9g4h/shared/Open_Access/xxxxxxxxxxxxxxxxx.bam

Each log message contains a number and a filename. The number is the
time in seconds of the os.stat call to the filename. From these numbers I deduce:

1. The time for os.stat is proportional to the file size (compare bam versus fasta file)
2. The time for os.stat does not reduce (see repeated calls to fasta file)

This is on london. I am already caching within python all the os.stat calls.
It seems to me that stat() from the command line is not similarly affected.
The issue might very well be in the python posix module.

Do you - or the people at arvados - have any advice? I expect this could be a generic issue for anyone trying to build workflows from data in arvados keep and using file stamps.

As the user pointed out it could very well be an issue with python, though I thought it was worth checking with you guys,

kind regards,

Roeland


Related issues

Related to Arvados - Bug #10629: [FUSE] slow enumerating files by collection uuidResolvedPeter Amstutz11/24/2016Actions
Actions

Also available in: Atom PDF