Project

General

Profile

Actions

Story #3491

open

[Keep] Support transparent compression of blocks in Keep

Added by Peter Amstutz almost 8 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Support automatic compression of blocks in the Keep server. Proposed design:

  • Keep server can accept PUT blocks compressed with gzip, or use gzip to compress uncompressed blocks before saving to disk. Compare compressed/uncompressed sizes to ensure that compression isn't adding unnecessary overhead.
  • On GET, Keep clients provide the "Accept-Encoding: gzip" header, and the server responds with "Content-Encoding: gzip" and spools the compressed data directly off disk.
  • Keep client decompresses the data before delivering it to the application.
Benefits:
  • Support random access into large files without needing special file formats or explicit application support.
  • Reduce disk and network usage across the board.
  • Transparent to user
Drawbacks:
  • Adds a "decompress and then re-compress on Keep block boundaries" step when working with a collection that's already compressed at the file level.
  • May increase latency and client overhead because each block needs to be decompressed in order to use it
Actions #1

Updated by Peter Amstutz almost 8 years ago

  • Subject changed from Support transparent compression of blocks in Keep to [Keep] Support transparent compression of blocks in Keep
  • Description updated (diff)
  • Category set to Keep
Actions #2

Updated by Tom Clegg almost 8 years ago

  • Target version set to Deferred
Actions #3

Updated by Stanislaw Adaszewski over 2 years ago

I would let the user decide whether blocks should be compressed or raw but this is definitely a great feature with potential for a lot of space savings. As a private person I would like this feature. I would implement it slightly differently though - basically use the checksum of the compressed data as the block address (like this there would be no need to decompress to verify the checksum and re-compress to send). Then the only thing should be the fuse driver should decompress blocks marked as gzip-compressed on-the-fly. If algos other than gzip were an option, there are compression schemes that are designed to be way faster to decompress, e.g. WKdm used for memory compression on Mac OS. This would perhaps be less convenient insofar that HTTP doesn't support it as encoding but it is much much faster.

Actions #4

Updated by Peter Amstutz over 2 years ago

The main reason this hasn't been a priority is that many file formats already have domain specific compression such as BAM and various compressed image formats. Trying to compress already-compressed files is counterproductive, since at best it is a waste of time and at worst the result is larger than if you had left it alone. It also turns out that at gigabit+ transfer speeds, involving the CPU to do compression/decompression can be a huge bottleneck compared to just sending the data uncompressed (for typical compression ratios).

Actions #5

Updated by Stanislaw Adaszewski over 2 years ago

Thank you for your reply. This makes sense. However, recently I unpacked UniRef30 for example and it jumped from 42GB compressed to 162GB uncompressed. Would be neat to have the compression as a user-controlled option. Some brainstorming on this could be worthwhile, as I am encountering this kind of ratio pretty often.

Actions

Also available in: Atom PDF