Idea #3491: [Keep] Support transparent compression of blocks in Keep - Arvados

Actions

Copy link

Idea #3491

open

[Keep] Support transparent compression of blocks in Keep

Added by Peter Amstutz over 10 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assigned To:

Category:

Keep

Target version:

Deferred

Start date:

Due date:

Story points:

Description

Support automatic compression of blocks in the Keep server. Proposed design:

Keep server can accept PUT blocks compressed with gzip, or use gzip to compress uncompressed blocks before saving to disk. Compare compressed/uncompressed sizes to ensure that compression isn't adding unnecessary overhead.
On GET, Keep clients provide the "Accept-Encoding: gzip" header, and the server responds with "Content-Encoding: gzip" and spools the compressed data directly off disk.
Keep client decompresses the data before delivering it to the application.

Benefits:

Support random access into large files without needing special file formats or explicit application support.
Reduce disk and network usage across the board.
Transparent to user

Drawbacks:

Adds a "decompress and then re-compress on Keep block boundaries" step when working with a collection that's already compressed at the file level.
May increase latency and client overhead because each block needs to be decompressed in order to use it

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Subject changed from Support transparent compression of blocks in Keep to [Keep] Support transparent compression of blocks in Keep
Description updated (diff)
Category set to Keep

Actions

Copy link

Updated by Tom Clegg over 10 years ago

Target version set to Deferred

Actions

Copy link

Updated by Stanislaw Adaszewski about 5 years ago

I would let the user decide whether blocks should be compressed or raw but this is definitely a great feature with potential for a lot of space savings. As a private person I would like this feature. I would implement it slightly differently though - basically use the checksum of the compressed data as the block address (like this there would be no need to decompress to verify the checksum and re-compress to send). Then the only thing should be the fuse driver should decompress blocks marked as gzip-compressed on-the-fly. If algos other than gzip were an option, there are compression schemes that are designed to be way faster to decompress, e.g. WKdm used for memory compression on Mac OS. This would perhaps be less convenient insofar that HTTP doesn't support it as encoding but it is much much faster.

Actions

Copy link

Updated by Peter Amstutz about 5 years ago

The main reason this hasn't been a priority is that many file formats already have domain specific compression such as BAM and various compressed image formats. Trying to compress already-compressed files is counterproductive, since at best it is a waste of time and at worst the result is larger than if you had left it alone. It also turns out that at gigabit+ transfer speeds, involving the CPU to do compression/decompression can be a huge bottleneck compared to just sending the data uncompressed (for typical compression ratios).

Actions

Copy link

Updated by Stanislaw Adaszewski about 5 years ago

Thank you for your reply. This makes sense. However, recently I unpacked UniRef30 for example and it jumped from 42GB compressed to 162GB uncompressed. Would be neat to have the compression as a user-controlled option. Some brainstorming on this could be worthwhile, as I am encountering this kind of ratio pretty often.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #3491

[Keep] Support transparent compression of blocks in Keep

Updated by Peter Amstutz over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Stanislaw Adaszewski about 5 years ago

Updated by Peter Amstutz about 5 years ago

Updated by Stanislaw Adaszewski about 5 years ago