Keep data block life cycle

Unlike most storage systems, Keep uses content-addressed data blocks, so usage of back-end storage does not correspond directly to the amount of data being stored at any given moment.

If a single data block is referenced by 7 collections with desired replication 1, and by 7 other collections with desired replication 2:
  • two copies of the block are stored
  • deleting one of the 14 collections has no impact on back-end usage
Even when all 14 collections are deleted, the unreferenced data block cannot necessarily be deleted right away to free up storage space:
  • Recently-written blocks cannot be garbage collected because a client might have a reference in memory and use it to create a collection (or read the data back).
  • Blocks referenced by recently-retrieved collections cannot be garbage collected, for the same reason.
  • Keep servers use a "trashed" state to accommodate eventually-consistent backend behavior (AWS S3) and to provide a safety net for recovering data that was deleted prematurely due to a bug or configuration problem.

For example, with TrashLifetime = 10d and BlobSignatureTTL = 10d, it takes at least 20d to recover the space used by a block -- starting at the last time the data was written to Keep by a client or referenced in a collection.

time (days)    keep0          keep1          client         api/db         comment

               ------------   ------------   ------------   ------------   ------------

+0             write B1       write B1

+1             write B1

+2             write B1

+3                                           create collection C1 referencing B1

+4                                           trash collection C1 ("trash_at=now", which implies "delete_at=+10d")

+5                            write B1

+13            (no action)                                                 10d (blob signature TTL) since last write on keep0, but still referenced by C1

+14                                                         collection C1 expires automatically
               trash B1                                                    10d (blob signature TTL) since last write on keep0

+15                           trash B1                                     10d (blob signature TTL) since last write on keep1

+24            delete B1                                                   10d (trash lifetime) since trashed on keep0

+25                           delete B1                                    10d (trash lifetime) since trashed on keep1