Project

General

Profile

Actions

Idea #21942

open

Poor performance when a collection consists mostly of small slices of many different large blocks

Added by Peter Amstutz 4 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Start date:
Due date:
Story points:
-

Description

User has a collection which consists of several thousand files that are 100-200 bytes each.

Each file was sourced from a different workflow output collection.

When these files were created, small file packing was applied, as a result these 100-200 byte files are embedded in a data block that is 50-60 MiB.

As a result, going through this collection and reading each file is much slower than expected, because behind the scenes, Arvados must fetch a 50-60 MiB block to extract the 100-200 byte slice the making up the file.

Think about ways to behave more efficiently where < 50% of a given block is referenced by a collection.

A couple ideas:

  1. Support range requests (interacts poorly with caching, though)
  2. When constructing/saving a collection that would be like this, do an "optimize" pass to rewrite/repack files
Actions #1

Updated by Peter Amstutz 4 days ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 4 days ago

  • Subject changed from Poor performance when a collection consists mostly of small slices of many large blocks to Poor performance when a collection consists mostly of small slices of many different large blocks
Actions

Also available in: Atom PDF