Project

General

Profile

Actions

Data Manager Design Doc » History » Revision 4

« Previous | Revision 4/5 (diff) | Next »
Tom Clegg, 12/30/2014 02:28 AM


Data Manager Design Doc

Misha Zatsman
Last Updated: July 23, 2014

Overview

Objective

The Data Manager periodically examines the aggregate state of keep storage. It validates that each block is sufficiently replicated, reports on user disk usage and determines which blocks should be deleted when space is needed.

The intended audience for this design doc is software engineers and bioinformaticians.

Background

The Keep Design Doc provides an overview of the Keep system.

The Data Manager also communicates directly with the API Server and indirectly with Workbench.

Alternatives

The aggregate state compiled periodically by the data manager could instead be written piecewise by the keep clients or servers, each time a block is written or read. This would add complexity and latency to the write and read process, especially when recovering from failures.

Design Goals and Tradeoffs

As the only component of the Keep system with visibility into the aggregate state, the Data Manager has several responsibilities:

  • Replication Enforcement - Detect when blocks are under or over replicated.
  • Garbage Collection - Detect which blocks are eligible for deletion and prioritize oldest blocks first.
  • Usage Reporting - Report on how much storage is used by each user.

High Level Design

The work is split between the Data Manager and Workbench.

The Data Manager is responsible for:
  • Data Aggregation - Reading the system state and producing useful reports.
  • Correction - Modifying keep server storage to fulfill replication expectations, balance usage or create free space.

To accomplish this, the Data Manager contacts the API Server and Keep Servers to learn about the aggregate state. It uses all the collected data to produce reports. Finally the Data Manager writes the reports to the API Server in the form of logs.

Workbench is responsible for:
  • Presentation - Displaying reports to users.

The Workbench reads the reports from the logs in the API Server and displays them to users.

Reports Produced

The Data Manager will produce the following reports:

  • Storage per user and project. See Usage Log Format below for more information.
    • Collection Storage of each project and user: The amount of disk space usage we report to our users.
      • This is the sum across all collections owned by the user/project and subprojects of the collection size multiplied by the requested replication rate.
      • We will split this sum into subtotals for persisted and ephemeral collections.
      • These are the only numbers reported to non-admin users.
      • These are also the only numbers that are computed per project.
    • Deduped Storage of each user: The amount of disk space this user would need if there were no other users.
      • This is the same as Collection Storage, except blocks shared between collections owned by the same user will no longer be double counted.
      • With any duplication, this number will be lower than the collection storage reported for the user thanks to content-addressed storage.
    • Weighted Storage of each user: The amount of disk space actually consumed by this user.
      • This is the same as Deduped Storage, except the cost of a block is shared between all users who stored that block and their portion is weighted by their requested replication level.
      • See Weighting Disk Usage below to understand how it's done.
      • The savings here are due to content-addressed storage, as above.
  • The disk space used for each of: persisted, ephemeral, unreferenced and cached blocks as well as the amount of disk that is free.
    • See Block Types below for an explanation of these different types.
  • A list of under-replicated blocks, their requested replication levels and which keep servers they appear on.
  • A list of over-replicated blocks, their requested replication levels and which keep servers they appear on.
  • A histogram depicting our garbage collection lifetime vs disk usage tradeoff.
    • The histogram will answer the question: If we deleted all of our non-persisted blocks older than X how much disk space would that free up.
    • See Garbage Collection Log Format below for more information.

Specifics

Interfaces

The Data Manager records its findings in log entries in the API Server. Workbench code will read those entries to display requested reports.

The following subsections describe the syntax and semantics of those log entries in detail.

Usage Log Format

On each pass through the Keep data, the Data Manager records one log entry for each user or project that owns any collections.

There will be many log entries for each user forming a history of usage over time, so if you want the current entry, make sure you read the latest one instead of an arbitrary one.

Each log entry will set the following fields:

  • object_uuid is set to the uuid of the user or project.
  • event_type is set to 'user-storage-report'.
  • The properties map has the following keys and values for all users and projects:
    • 'collection_storage_bytes': The sum of the collection byte size times requested replication level across all collections owned by the user/project, either directly or through subprojects.
  • The properties map has the following keys and values for all users (but not projects). These numbers are intended for admin users to better understand the usage of other users for capacity planning, and should invisible to users without admin privileges:
    • 'deduped_storage_bytes': The sum of the block byte size times requested replication level of all blocks that appear in collections owned by the user, either directly or through subprojects.
    • 'weighted_storage_bytes': Same as above, except each block's size is weighted by the replication level for the user.
Some definitions and gotchas:
  • Since a block may appear in multiple collections, its replication level for a given user is the max replication level requested in all the user's collections which contain the block.
  • weighted_storage_bytes summed across all users will sum to the disk space used by all blocks that are persisted (assuming that blocks are replicated as requested).
  • Computing weighted_storage_bytes is a bit tricky, see Weighting Disk Usage below to understand how it's done.
  • For another take on the different fields, see Reports Produced above.

Garbage Collection Log Format

The data in the Garbage Collection Log illustrates the tradeoff between how long we keep around blocks that have not been persisted and how much of our disk space is free.

The Data Manager records one Garbage Collection Log on each pass through the Keep data. There will be many such log entries, so if you want the current state, make sure you read the latest one instead of an arbitrary one.

Each log entry will set the following fields:
  • The event_type is set to 'block-age-free-space-histogram'.
  • The properties map has the histogram stored at the key 'histogram'

The histogram is represented as a list of pairs (mtime, disk_proportion_free).

An entry with mtime m and disk_proportion_free p means that if we deleted all non-persisted blocks with mtime older than m we would end up with p of our keep disk space free.
  • mtime is represented as seconds since the epoch.
  • disk_proportion_free is represented as a number between 0 and 1.
Some definitions and gotchas:
  • Since persisted blocks will not be deleted by definition, we will never see a value of 1 for disk_proportion_free in the histogram unless there are no blocks that are persisted.
  • We will never see a value of 0 for disk_proportion_free in the histogram unless our disk is full.
  • The mtime we use for a given block is the max mtime across all keep servers storing that block. In most cases we expect the disagreement to be minimal, but in some cases (e.g. when a block has been copied much later to increase replication) it may be significant. During actual garbage collection, block ids should be used instead of mtimes. This data is just used to provide insights for administrators.

Detailed Design

Block Types

The Keep data model uses collections to organize data. However Keep Servers are unaware of collections and instead read and write data in blocks.

A collection contains one or more blocks. A block can be a part of many collections (or none).

A collection will be in one of the following persistence states: Persisted, Ephemeral or Deleted. See Persisting Data in the Keep Design Doc for a discussion of these states.

A block can exist in one of four states based on the collections that contain it and its age:
  • A block is persisted if it belongs to at least one persisted collection.
  • A block is ephemeral if it is not persisted and belongs to at least one ephemeral collection.
  • A block is unreferenced if it is not in any persisted or ephemeral collection and less than two weeks old.
  • A block is cached if it is not in any persisted or ephemeral collection and at least two weeks old.

Ownership of a collection

  • The owner of a collection can be a user or a project.
  • The owner of a project can be a user or another project.
  • Eventually if you follow the chain of owners from a collection, you will get to a user.
  • That user will be charged for the disk space used by the collection.
  • Since a block may appear in multiple collections, it may have multiple owners.

Weighting Disk Usage

When a block is stored by multiple users (and/or projects) at different replication levels, the cost of each additional copy is shared by all users who want at least that many copies.

Let's assume that a given block is 12 Megabytes in size and is stored by four users:
  • User A has requested a replication level of 2
  • User B has requested a replication level of 7
  • User C has requested a replication level of 3
  • User D has requested a replication level of 5

So Keep will store 7 copies of the block, taking up 7 x 12 = 84 Megabytes.

Now we compute how to distribute the cost of each copy on disk:
  • The first copy is required by all four users, so they each pay a charge of 12 / 4 = 3 Megabytes.
  • The second copy is also required by all four users, so they each pay another charge of 3 Megabytes.
  • The third copy is required by 3 users (B, C & D), so they each pay a charge of 12 / 3 = 4 Megabytes.
  • The fourth copy is required by 2 users (B & D), so they each pay a charge of 12 / 2 = 6 Megabytes.
  • The fifth copy is also required by B & D, so they each pay another charge of 6 Megabytes.
  • The sixth copy is required only by B, so she pays a charge of 12 / 1 = 12 Megabytes.
  • The seventh copy is again required only by B, so she pays another charge of 12 Megabytes.
Now we sum up all the charges for the users which requested those copies:
  • A is charged for 3 + 3 = 6 Megabytes.
  • C is charged for 3 + 3 + 4 = 10 Megabytes.
  • D is charged for 3 + 3 + 4 + 6 + 6 = 22 Megabytes.
  • B is charged for 3 + 3 + 4 + 6 + 6 + 12 + 12 = 46 Megabytes.

Notice that the sum of the weighted cost to the users is the disk space used by the block: 6 + 10 + 22 + 46 = 84 Megabytes.

Code Location

  • The prototype datamanager is located in arvados/services/datamanager/experimental/datamanager.py
    • The prototype is not being actively developed
  • The production datamanager is located in arvados/services/datamanager/datamanager.go
    • The production datamanager is being actively developed by Misha Zatsman

Testing Plan

How you will verify the behavior of your system. Once the system is written, this section should be updated to reflect the current state of testing and future aspirations.

Logging

What your system will record and how.

Debugging

How users can debug interactions with your system. When designing a system it's important to think about what tools you can provide to make debugging problems easier. Sometimes it's unclear whether the problem is in your system at all, so a mechanism for isolating a particular interaction and examining it to see if your system behaved as expected is very valuable. Once a system is in use, this is a great place to put tips and recipes for debugging. If this section grows too large, the mechanisms can be summarized here and individual tips can be moved to another document.

Caveats

If a user has stored some data, the data manager will periodically produce usage reports for that user. If the user later deletes all of their data, the data manager will no longer produce usage reports for that user. Therefore loading the latest usage report for the user will actually report, incorrectly, that the user is using some storage capacity.

Security Concerns

This section should describe possible threats (denial of service, malicious requests, etc) and what, if anything, is being done to protect against them. Be sure to list concerns for which you don't have a solution or you believe don't need a solution. Security concerns that we don't need to worry about also belong here (e.g. we don't need to worry about denial of service attacks for this system because it only receives requests from the api server which already has DOS attack protections).

Open Questions and Risks

This section should describe design questions that have not been decided yet, research that needs to be done and potential risks that could make make this system less effective or more difficult to implement.

Some examples are: Should we communicate using TCP or UDP? How often do we expect our users to interrupt running jobs? This relies on an undocumented third-party API which may be turned off at any point.

For each question you should include any relevant information you know. For risks you should include estimates of likelihood, cost if they occur and ideas for possible workarounds.

Work Estimates

Split the work into milestones that can be delivered, put them in the order that you think they should be done, and estimate roughly how much time you expect it each milestone to take. Ideally each milestone will take one week or less.

Future Work

Additional Reports

The following reports could be useful for system administrators, both for debugging issues outside of data manager and to gain confidence that the data manager is seeing the whole system.

  • Total Storage Size (overall and per keep server) and space free
  • Did we get everything we needed? e.g. were all keep servers reachable and all collections retrieved.
  • Start Time, End Time and Time Taken for Data Manager Processing and each stage of it (i.e. collection scanning, keep server scanning)
  • Number of collections read
  • Number of distinct blocks referenced by collections, distribution of #references per block
  • Number of blocks, and number & total size of distinct blocks, per Keep server
  • Number & total size of blocks with replication lower than desired
  • Number & total size of blocks with replication higher than desired
  • Oldest and youngest collection seen (both by modified date and creation date)
  • Distribution of collection sizes and replication levels

Revision History

The table below should record the major changes to this document. You don't need to add an entry for typo fixes, other small changes or changes before finishing the initial draft.

Date Revisions Made Author Reviewed By
June 25, 2014 Initial Draft Misha Zatsman ----
July 23, 2014 Added Reports Produced, Block Types & updated other sections Misha Zatsman Tom Clegg, Ward Vandewege, Tim Pierce

Updated by Tom Clegg about 9 years ago · 4 revisions