Project

General

Profile

Data Manager Design Doc » History » Revision 4

Revision 3 (Misha Zatsman, 12/24/2014 01:23 AM) → Revision 4/5 (Tom Clegg, 12/30/2014 02:28 AM)

{{>toc}} 

 h1=. Data Manager Design Doc 

 p>. *"Misha Zatsman":mailto:misha@curoverse.com 
 Last Updated: July 23, 2014* 

 h2. Overview 

 h3. Objective 

 The Data Manager periodically examines the aggregate state of keep storage. It validates that each block is sufficiently replicated, reports on user disk usage and determines which blocks should be deleted when space is needed. 

 The intended audience for this design doc is software engineers and bioinformaticians. 

 h3. Background 

 The [[Keep Design Doc]] provides an overview of the Keep system. 

 The Data Manager also communicates directly with the [[REST API Server|API Server]] and indirectly with [[Workbench]]. 

 h3. Alternatives 

 The aggregate state compiled periodically by the data manager could instead be written piecewise by the keep clients or servers, each time a block is written or read. This would add complexity and latency to the write and read process, especially when recovering from failures. 

 h3. Design Goals and Tradeoffs 

 As the only component of the Keep system with visibility into the aggregate state, the Data Manager has several responsibilities: 

 * *Replication Enforcement* - Detect when blocks are under or over replicated.  
 * *Garbage Collection* - Detect which blocks are eligible for deletion and prioritize oldest blocks first. 
 * *Usage Reporting* - Report on how much storage is used by each user. 

 h3. High Level Design 

 !https://docs.google.com/drawings/d/17W2DOxBPVtaTH0J364x3BXDcynRsJWXMx6fuiXW770M/pub?w=723&h=403!:https://docs.google.com/a/curoverse.com/drawings/d/17W2DOxBPVtaTH0J364x3BXDcynRsJWXMx6fuiXW770M/edit?usp=sharing 

 The work is split between the Data Manager and Workbench. 

 The Data Manager is responsible for: 
 * *Data Aggregation* - Reading the system state and producing useful reports. 
 * *Correction* - Modifying keep server storage to fulfill replication expectations, balance usage or create free space. 

 To accomplish this, the Data Manager contacts the API Server and Keep Servers to learn about the aggregate state. It uses all the collected data to produce reports. Finally the Data Manager writes the reports to the API Server in the form of logs. 

 Workbench is responsible for: 
 * *Presentation* - Displaying reports to users. 

 The Workbench reads the reports from the logs in the API Server and displays them to users. 

 h3. Reports Produced 

 The Data Manager will produce the following reports: 

 * Storage per user and project. See [[Data Manager Design Doc#Usage Log Format|Usage Log Format]] below for more information. 
 ** *Collection Storage* of each project and user: The amount of disk space usage we report to our users. 
 *** This is the sum across all collections owned by the user/project and subprojects of the collection size multiplied by the requested replication rate. 
 *** We will split this sum into subtotals for persisted and ephemeral collections. 
 *** These are the only numbers reported to non-admin users. 
 *** These are also the only numbers that are computed per project. 
 ** *Deduped Storage* of each user: The amount of disk space this user would need if there were no other users. 
 *** This is the same as Collection Storage, except blocks shared between collections owned by the same user will no longer be double counted. 
 *** With any duplication, this number will be lower than the collection storage reported for the user thanks to content-addressed storage. 
 ** *Weighted Storage* of each user: The amount of disk space actually consumed by this user. 
 *** This is the same as Deduped Storage, except the cost of a block is shared between all users who stored that block and their portion is weighted by their requested replication level. 
 *** See [[Data Manager Design Doc#Weighting Disk Usage|Weighting Disk Usage]] below to understand how it's done. 
 *** The savings here are due to content-addressed storage, as above. 
 * The disk space used for each of: persisted, ephemeral, unreferenced and cached blocks as well as the amount of disk that is free. 
 ** See [[Data Manager Design Doc#Block Types|Block Types]] below for an explanation of these different types. 
 * A list of under-replicated blocks, their requested replication levels and which keep servers they appear on. 
 * A list of over-replicated blocks, their requested replication levels and which keep servers they appear on. 
 * A histogram depicting our garbage collection lifetime vs disk usage tradeoff. 
 ** The histogram will answer the question: If we deleted all of our non-persisted blocks older than X how much disk space would that free up. 
 ** See [[Data Manager Design Doc#Garbage Collection Log Format|Garbage Collection Log Format]] below for more information. 

 h2. Specifics 

 h3. Interfaces 

 The Data Manager records its findings in log entries in the API Server. Workbench code will read those entries to display requested reports. 

 The following subsections describe the syntax and semantics of those log entries in detail. 

 h4. Usage Log Format 

 On each pass through the Keep data, the Data Manager records one log entry for each user or project that owns any collections. 

 There will be many log entries for each user forming a history of usage over time, so if you want the current entry, make sure you read the latest one instead of an arbitrary one. 

 Each log entry will set the following fields: 

 * <code>object_uuid</code> is set to the <code>uuid</code> of the user or project. 
 * <code>event_type</code> is set to '<code>user-storage-report</code>'. 
 * The <code>properties</code> map has the following keys and values for all users and projects: 
 ** <code>'collection_storage_bytes'</code>: The sum of the collection byte size times requested replication level across all collections owned by the user/project, either directly or through subprojects. 
 * The <code>properties</code> map has the following keys and values for all users (but not projects). These numbers are intended for admin users to better understand the usage of other users for capacity planning, and should invisible to users without admin privileges: 
 ** <code>'deduped_storage_bytes'</code>: The sum of the block byte size times requested replication level of all blocks that appear in collections owned by the user, either directly or through subprojects. 
 ** <code>'weighted_storage_bytes'</code>: Same as above, except each block's size is weighted by the replication level for the user. 

 *Some definitions and gotchas:* 
 * Since a block may appear in multiple collections, its replication level for a given user is the max replication level requested in all the user's collections which contain the block. 
 * <code>weighted_storage_bytes</code> summed across all users will sum to the disk space used by all blocks that are persisted (assuming that blocks are replicated as requested). 
 * Computing <code>weighted_storage_bytes</code> is a bit tricky, see [[Data Manager Design Doc#Weighting Disk Usage|Weighting Disk Usage]] below to understand how it's done. 
 * For another take on the different fields, see    [[Data Manager Design Doc#Reports Produced|Reports Produced]] above. 

 h4. Garbage Collection Log Format 

 The data in the Garbage Collection Log illustrates the tradeoff between how long we keep around blocks that have not been persisted and how much of our disk space is free. 

 The Data Manager records one Garbage Collection Log on each pass through the Keep data. There will be many such log entries, so if you want the current state, make sure you read the latest one instead of an arbitrary one. 

 Each log entry will set the following fields: 
 * The <code>event_type</code> is set to <code>'block-age-free-space-histogram'</code>. 
 * The <code>properties</code> map has the histogram stored at the key <code>'histogram'</code> 

 The histogram is represented as a list of pairs (<code>mtime</code>, <code>disk_proportion_free</code>). 

 An entry with <code>mtime m</code> and <code>disk_proportion_free p</code> means that if we deleted all non-persisted blocks with <code>mtime</code> older than <code>m</code> we would end up with <code>p</code> of our keep disk space free. 
 * <code>mtime</code> is represented as seconds since the epoch. 
 * <code>disk_proportion_free</code> is represented as a number between 0 and 1. 

 *Some definitions and gotchas:* 
 * Since persisted blocks will not be deleted by definition, we will never see a value of 1 for <code>disk_proportion_free</code> in the histogram unless there are no blocks that are persisted. 
 * We will never see a value of 0 for <code>disk_proportion_free</code> in the histogram unless our disk is full. 
 * The mtime we use for a given block is the max mtime across all keep servers storing that block. In most cases we expect the disagreement to be minimal, but in some cases (e.g. when a block has been copied much later to increase replication) it may be significant. During actual garbage collection, block ids should be used instead of mtimes. This data is just used to provide insights for administrators. 

 h3. Detailed Design 

 h4. Block Types 

 The Keep data model uses collections to organize data. However Keep Servers are unaware of collections and instead read and write data in blocks. 

 A collection contains one or more blocks. A block can be a part of many collections (or none). 

 A collection will be in one of the following persistence states: Persisted, Ephemeral or Deleted. See [[Keep Design Doc#Persisting Data|Persisting Data]] in the Keep Design Doc for a discussion of these states. 

 A block can exist in one of four states based on the collections that contain it and its age: 
 * A block is *persisted* if it belongs to at least one persisted collection. 
 * A block is *ephemeral* if it is not persisted and belongs to at least one ephemeral collection. 
 * A block is *unreferenced* if it is not in any persisted or ephemeral collection and less than two weeks old. 
 * A block is *cached* if it is not in any persisted or ephemeral collection and at least two weeks old. 

 h4. Ownership of a collection 

 * The owner of a collection can be a user or a project. 
 * The owner of a project can be a user or another project. 
 * Eventually if you follow the chain of owners from a collection, you will get to a user. 
 * That user will be charged for the disk space used by the collection. 
 * Since a block may appear in multiple collections, it may have multiple owners. 

 h4. Weighting Disk Usage 

 When a block is stored by multiple users (and/or projects) at different replication levels, the cost of each additional copy is shared by all users who want at least that many copies. 

 Let's assume that a given block is 12 Megabytes in size and is stored by four users: 
 * User A has requested a replication level of 2 
 * User B has requested a replication level of 7 
 * User C has requested a replication level of 3 
 * User D has requested a replication level of 5 

 So Keep will store 7 copies of the block, taking up <code>7 x 12 = 84</code> Megabytes. 

 Now we compute how to distribute the cost of each copy on disk: 
 * The first copy is required by all four users, so they each pay a charge of <code>12 / 4 = 3</code> Megabytes. 
 * The second copy is also required by all four users, so they each pay another charge of 3 Megabytes. 
 * The third copy is required by 3 users (B, C & D), so they each pay a charge of <code>12 / 3 = 4</code> Megabytes. 
 * The fourth copy is required by 2 users (B & D), so they each pay a charge of <code>12 / 2 = 6</code> Megabytes. 
 * The fifth copy is also required by B & D, so they each pay another charge of 6 Megabytes. 
 * The sixth copy is required only by B, so she pays a charge of <code>12 / 1 = 12</code> Megabytes. 
 * The seventh copy is again required only by B, so she pays another charge of 12 Megabytes. 

 Now we sum up all the charges for the users which requested those copies: 
 * A is charged for <code>3 + 3 = 6</code> Megabytes. 
 * C is charged for <code>3 + 3 + 4 = 10</code> Megabytes. 
 * D is charged for <code>3 + 3 + 4 + 6 + 6 = 22</code> Megabytes. 
 * B is charged for <code>3 + 3 + 4 + 6 + 6 + 12 + 12 = 46</code> Megabytes. 

 Notice that the sum of the weighted cost to the users is the disk space used by the block: <code>6 + 10 + 22 + 46 = 84</code> Megabytes. 

 h3. Code Location 

 * The prototype datamanager is located in <code>arvados/services/datamanager/experimental/datamanager.py</code> 
 ** The prototype is not being actively developed 
 * The production datamanager is located in <code>arvados/services/datamanager/datamanager.go</code> 
 ** The production datamanager is being actively developed by Misha Zatsman 

 h3. Testing Plan 

 _How you will verify the behavior of your system. Once the system is written, this section should be updated to reflect the current state of testing and future aspirations._ 

 h3. Logging 

 _What your system will record and how._ 

 h3. Debugging 

 _How users can debug interactions with your system. When designing a system it's important to think about what tools you can provide to make debugging problems easier. Sometimes it's unclear whether the problem is in your system at all, so a mechanism for isolating a particular interaction and examining it to see if your system behaved as expected is very valuable. Once a system is in use, this is a great place to put tips and recipes for debugging. If this section grows too large, the mechanisms can be summarized here and individual tips can be moved to another document._ 

 h3. Caveats 

 If a user has stored some data, the data manager will periodically produce usage reports for that user. If the user later deletes all of their data, the data manager will no longer produce usage reports for that user. Therefore loading the latest usage report for the user will actually report, incorrectly, that the user is using some storage capacity. 

 h3. Security Concerns 

 _This section should describe possible threats (denial of service, malicious requests, etc) and what, if anything, is being done to protect against them. Be sure to list concerns for which you don't have a solution or you believe don't need a solution. Security concerns that we don't need to worry about also belong here (e.g. we don't need to worry about denial of service attacks for this system because it only receives requests from the api server which already has DOS attack protections)._ 

 h3. Open Questions and Risks 

 _This section should describe design questions that have not been decided yet, research that needs to be done and potential risks that could make make this system less effective or more difficult to implement._ 

 _Some examples are: Should we communicate using TCP or UDP? How often do we expect our users to interrupt running jobs? This relies on an undocumented third-party API which may be turned off at any point._ 

 _For each question you should include any relevant information you know. For risks you should include estimates of likelihood, cost if they occur and ideas for possible workarounds._ 

 h3. Work Estimates 

 _Split the work into milestones that can be delivered, put them in the order that you think they should be done, and estimate roughly how much time you expect it each milestone to take. Ideally each milestone will take one week or less._ 

 h3. Future Work 

 

 h4. Additional Reports 

 The following reports could be useful for system administrators, both for debugging issues outside of data manager and to gain confidence that the data manager is seeing the whole system. 

 * Total Storage Size (overall and per keep server) and space free 
 * Did we get everything we needed? e.g. were all keep servers reachable and all collections retrieved. 
 * Start Time, End Time and Time Taken for Data Manager Processing and each stage of it (i.e. collection scanning, keep server scanning) scanning). 
 * Number of collections read 
 * Number of distinct blocks referenced by collections, distribution of #references per block 
 * Number of blocks, and number & total size of distinct blocks, per Keep server 
 * Number & total size of blocks with replication lower than desired 
 * Number & total size of blocks with replication higher than desired 
 * Oldest and youngest collection seen (both by modified date and creation date) 
 * Distribution of collection sizes and replication levels 
 * Number of collections read 
 * Total Storage Size (overall and per keep server) and space free 
 * Did we get everything we needed? e.g. were all keep servers reachable and all collections retrieved. 


 h3. Revision History 

 _The table below should record the major changes to this document. You don't need to add an entry for typo fixes, other small changes or changes before finishing the initial draft._ 

 |_.Date              |_.Revisions Made |_.Author              |_.Reviewed By       | 
 | June 25, 2014 | Initial Draft           | Misha Zatsman |=. ----                | 
 | July 23, 2014    | Added Reports Produced, Block Types & updated other sections | Misha Zatsman | Tom Clegg, Ward Vandewege, Tim Pierce |