Project

General

Profile

Data Manager Design Doc » History » Version 3

Misha Zatsman, 12/24/2014 01:23 AM

1 1 Misha Zatsman
{{>toc}}
2
3
h1=. Data Manager Design Doc
4
5
p>. *"Misha Zatsman":mailto:misha@curoverse.com
6
Last Updated: July 23, 2014*
7
8
h2. Overview
9
10
h3. Objective
11
12
The Data Manager periodically examines the aggregate state of keep storage. It validates that each block is sufficiently replicated, reports on user disk usage and determines which blocks should be deleted when space is needed.
13
14
The intended audience for this design doc is software engineers and bioinformaticians.
15
16
h3. Background
17
18
The [[Keep Design Doc]] provides an overview of the Keep system.
19
20
The Data Manager also communicates directly with the [[REST API Server|API Server]] and indirectly with [[Workbench]].
21
22
h3. Alternatives
23
24
The aggregate state compiled periodically by the data manager could instead be written piecewise by the keep clients or servers, each time a block is written or read. This would add complexity and latency to the write and read process, especially when recovering from failures.
25
26
h3. Design Goals and Tradeoffs
27
28
As the only component of the Keep system with visibility into the aggregate state, the Data Manager has several responsibilities:
29
30
* *Replication Enforcement* - Detect when blocks are under or over replicated. 
31
* *Garbage Collection* - Detect which blocks are eligible for deletion and prioritize oldest blocks first.
32
* *Usage Reporting* - Report on how much storage is used by each user.
33
34
h3. High Level Design
35
36
!https://docs.google.com/drawings/d/17W2DOxBPVtaTH0J364x3BXDcynRsJWXMx6fuiXW770M/pub?w=723&h=403!:https://docs.google.com/a/curoverse.com/drawings/d/17W2DOxBPVtaTH0J364x3BXDcynRsJWXMx6fuiXW770M/edit?usp=sharing
37
38
The work is split between the Data Manager and Workbench.
39
40
The Data Manager is responsible for:
41
* *Data Aggregation* - Reading the system state and producing useful reports.
42
* *Correction* - Modifying keep server storage to fulfill replication expectations, balance usage or create free space.
43
44
To accomplish this, the Data Manager contacts the API Server and Keep Servers to learn about the aggregate state. It uses all the collected data to produce reports. Finally the Data Manager writes the reports to the API Server in the form of logs.
45
46
Workbench is responsible for:
47
* *Presentation* - Displaying reports to users.
48
49
The Workbench reads the reports from the logs in the API Server and displays them to users.
50
51
h3. Reports Produced
52
53
The Data Manager will produce the following reports:
54
55
* Storage per user and project. See [[Data Manager Design Doc#Usage Log Format|Usage Log Format]] below for more information.
56
** *Collection Storage* of each project and user: The amount of disk space usage we report to our users.
57
*** This is the sum across all collections owned by the user/project and subprojects of the collection size multiplied by the requested replication rate.
58
*** We will split this sum into subtotals for persisted and ephemeral collections.
59
*** These are the only numbers reported to non-admin users.
60
*** These are also the only numbers that are computed per project.
61
** *Deduped Storage* of each user: The amount of disk space this user would need if there were no other users.
62
*** This is the same as Collection Storage, except blocks shared between collections owned by the same user will no longer be double counted.
63
*** With any duplication, this number will be lower than the collection storage reported for the user thanks to content-addressed storage.
64
** *Weighted Storage* of each user: The amount of disk space actually consumed by this user.
65
*** This is the same as Deduped Storage, except the cost of a block is shared between all users who stored that block and their portion is weighted by their requested replication level.
66
*** See [[Data Manager Design Doc#Weighting Disk Usage|Weighting Disk Usage]] below to understand how it's done.
67
*** The savings here are due to content-addressed storage, as above.
68
* The disk space used for each of: persisted, ephemeral, unreferenced and cached blocks as well as the amount of disk that is free.
69
** See [[Data Manager Design Doc#Block Types|Block Types]] below for an explanation of these different types.
70
* A list of under-replicated blocks, their requested replication levels and which keep servers they appear on.
71
* A list of over-replicated blocks, their requested replication levels and which keep servers they appear on.
72
* A histogram depicting our garbage collection lifetime vs disk usage tradeoff.
73
** The histogram will answer the question: If we deleted all of our non-persisted blocks older than X how much disk space would that free up.
74
** See [[Data Manager Design Doc#Garbage Collection Log Format|Garbage Collection Log Format]] below for more information.
75
76
h2. Specifics
77
78
h3. Interfaces
79
80
The Data Manager records its findings in log entries in the API Server. Workbench code will read those entries to display requested reports.
81
82
The following subsections describe the syntax and semantics of those log entries in detail.
83
84
h4. Usage Log Format
85
86
On each pass through the Keep data, the Data Manager records one log entry for each user or project that owns any collections.
87
88
There will be many log entries for each user forming a history of usage over time, so if you want the current entry, make sure you read the latest one instead of an arbitrary one.
89
90
Each log entry will set the following fields:
91
92
* <code>object_uuid</code> is set to the <code>uuid</code> of the user or project.
93
* <code>event_type</code> is set to '<code>user-storage-report</code>'.
94
* The <code>properties</code> map has the following keys and values for all users and projects:
95
** <code>'collection_storage_bytes'</code>: The sum of the collection byte size times requested replication level across all collections owned by the user/project, either directly or through subprojects.
96
* The <code>properties</code> map has the following keys and values for all users (but not projects). These numbers are intended for admin users to better understand the usage of other users for capacity planning, and should invisible to users without admin privileges:
97
** <code>'deduped_storage_bytes'</code>: The sum of the block byte size times requested replication level of all blocks that appear in collections owned by the user, either directly or through subprojects.
98
** <code>'weighted_storage_bytes'</code>: Same as above, except each block's size is weighted by the replication level for the user.
99
100
*Some definitions and gotchas:*
101
* Since a block may appear in multiple collections, its replication level for a given user is the max replication level requested in all the user's collections which contain the block.
102
* <code>weighted_storage_bytes</code> summed across all users will sum to the disk space used by all blocks that are persisted (assuming that blocks are replicated as requested).
103
* Computing <code>weighted_storage_bytes</code> is a bit tricky, see [[Data Manager Design Doc#Weighting Disk Usage|Weighting Disk Usage]] below to understand how it's done.
104
* For another take on the different fields, see  [[Data Manager Design Doc#Reports Produced|Reports Produced]] above.
105
106
h4. Garbage Collection Log Format
107
108
The data in the Garbage Collection Log illustrates the tradeoff between how long we keep around blocks that have not been persisted and how much of our disk space is free.
109
110
The Data Manager records one Garbage Collection Log on each pass through the Keep data. There will be many such log entries, so if you want the current state, make sure you read the latest one instead of an arbitrary one.
111
112
Each log entry will set the following fields:
113
* The <code>event_type</code> is set to <code>'block-age-free-space-histogram'</code>.
114
* The <code>properties</code> map has the histogram stored at the key <code>'histogram'</code>
115
116
The histogram is represented as a list of pairs (<code>mtime</code>, <code>disk_proportion_free</code>).
117
118
An entry with <code>mtime m</code> and <code>disk_proportion_free p</code> means that if we deleted all non-persisted blocks with <code>mtime</code> older than <code>m</code> we would end up with <code>p</code> of our keep disk space free.
119
* <code>mtime</code> is represented as seconds since the epoch.
120
* <code>disk_proportion_free</code> is represented as a number between 0 and 1.
121
122
*Some definitions and gotchas:*
123
* Since persisted blocks will not be deleted by definition, we will never see a value of 1 for <code>disk_proportion_free</code> in the histogram unless there are no blocks that are persisted.
124
* We will never see a value of 0 for <code>disk_proportion_free</code> in the histogram unless our disk is full.
125
* The mtime we use for a given block is the max mtime across all keep servers storing that block. In most cases we expect the disagreement to be minimal, but in some cases (e.g. when a block has been copied much later to increase replication) it may be significant. During actual garbage collection, block ids should be used instead of mtimes. This data is just used to provide insights for administrators.
126
127
h3. Detailed Design
128
129
h4. Block Types
130
131
The Keep data model uses collections to organize data. However Keep Servers are unaware of collections and instead read and write data in blocks.
132
133
A collection contains one or more blocks. A block can be a part of many collections (or none).
134
135
A collection will be in one of the following persistence states: Persisted, Ephemeral or Deleted. See [[Keep Design Doc#Persisting Data|Persisting Data]] in the Keep Design Doc for a discussion of these states.
136
137
A block can exist in one of four states based on the collections that contain it and its age:
138
* A block is *persisted* if it belongs to at least one persisted collection.
139
* A block is *ephemeral* if it is not persisted and belongs to at least one ephemeral collection.
140
* A block is *unreferenced* if it is not in any persisted or ephemeral collection and less than two weeks old.
141
* A block is *cached* if it is not in any persisted or ephemeral collection and at least two weeks old.
142
143
h4. Ownership of a collection
144
145
* The owner of a collection can be a user or a project.
146
* The owner of a project can be a user or another project.
147
* Eventually if you follow the chain of owners from a collection, you will get to a user.
148
* That user will be charged for the disk space used by the collection.
149
* Since a block may appear in multiple collections, it may have multiple owners.
150
151
h4. Weighting Disk Usage
152
153
When a block is stored by multiple users (and/or projects) at different replication levels, the cost of each additional copy is shared by all users who want at least that many copies.
154
155
Let's assume that a given block is 12 Megabytes in size and is stored by four users:
156
* User A has requested a replication level of 2
157
* User B has requested a replication level of 7
158
* User C has requested a replication level of 3
159
* User D has requested a replication level of 5
160
161
So Keep will store 7 copies of the block, taking up <code>7 x 12 = 84</code> Megabytes.
162
163
Now we compute how to distribute the cost of each copy on disk:
164
* The first copy is required by all four users, so they each pay a charge of <code>12 / 4 = 3</code> Megabytes.
165
* The second copy is also required by all four users, so they each pay another charge of 3 Megabytes.
166
* The third copy is required by 3 users (B, C & D), so they each pay a charge of <code>12 / 3 = 4</code> Megabytes.
167
* The fourth copy is required by 2 users (B & D), so they each pay a charge of <code>12 / 2 = 6</code> Megabytes.
168
* The fifth copy is also required by B & D, so they each pay another charge of 6 Megabytes.
169
* The sixth copy is required only by B, so she pays a charge of <code>12 / 1 = 12</code> Megabytes.
170
* The seventh copy is again required only by B, so she pays another charge of 12 Megabytes.
171
172
Now we sum up all the charges for the users which requested those copies:
173
* A is charged for <code>3 + 3 = 6</code> Megabytes.
174
* C is charged for <code>3 + 3 + 4 = 10</code> Megabytes.
175
* D is charged for <code>3 + 3 + 4 + 6 + 6 = 22</code> Megabytes.
176
* B is charged for <code>3 + 3 + 4 + 6 + 6 + 12 + 12 = 46</code> Megabytes.
177
178
Notice that the sum of the weighted cost to the users is the disk space used by the block: <code>6 + 10 + 22 + 46 = 84</code> Megabytes.
179
180
h3. Code Location
181
182 2 Misha Zatsman
* The prototype datamanager is located in <code>arvados/services/datamanager/experimental/datamanager.py</code>
183
** The prototype is not being actively developed
184
* The production datamanager is located in <code>arvados/services/datamanager/datamanager.go</code>
185
** The production datamanager is being actively developed by Misha Zatsman
186 1 Misha Zatsman
187
h3. Testing Plan
188
189
_How you will verify the behavior of your system. Once the system is written, this section should be updated to reflect the current state of testing and future aspirations._
190
191
h3. Logging
192
193
_What your system will record and how._
194
195
h3. Debugging
196
197
_How users can debug interactions with your system. When designing a system it's important to think about what tools you can provide to make debugging problems easier. Sometimes it's unclear whether the problem is in your system at all, so a mechanism for isolating a particular interaction and examining it to see if your system behaved as expected is very valuable. Once a system is in use, this is a great place to put tips and recipes for debugging. If this section grows too large, the mechanisms can be summarized here and individual tips can be moved to another document._
198
199
h3. Caveats
200
201
If a user has stored some data, the data manager will periodically produce usage reports for that user. If the user later deletes all of their data, the data manager will no longer produce usage reports for that user. Therefore loading the latest usage report for the user will actually report, incorrectly, that the user is using some storage capacity.
202
203
h3. Security Concerns
204
205
_This section should describe possible threats (denial of service, malicious requests, etc) and what, if anything, is being done to protect against them. Be sure to list concerns for which you don't have a solution or you believe don't need a solution. Security concerns that we don't need to worry about also belong here (e.g. we don't need to worry about denial of service attacks for this system because it only receives requests from the api server which already has DOS attack protections)._
206
207
h3. Open Questions and Risks
208
209
_This section should describe design questions that have not been decided yet, research that needs to be done and potential risks that could make make this system less effective or more difficult to implement._
210
211
_Some examples are: Should we communicate using TCP or UDP? How often do we expect our users to interrupt running jobs? This relies on an undocumented third-party API which may be turned off at any point._
212
213
_For each question you should include any relevant information you know. For risks you should include estimates of likelihood, cost if they occur and ideas for possible workarounds._
214
215
h3. Work Estimates
216
217
_Split the work into milestones that can be delivered, put them in the order that you think they should be done, and estimate roughly how much time you expect it each milestone to take. Ideally each milestone will take one week or less._
218
219
h3. Future Work
220
221 3 Misha Zatsman
h4. Additional Reports
222
223
The following reports could be useful for system administrators, both for debugging issues outside of data manager and to gain confidence that the data manager is seeing the whole system.
224
225
* Start Time, End Time and Time Taken for Data Manager Processing and each stage of it (i.e. collection scanning, keep server scanning).
226
* Oldest and youngest collection seen (both by modified date and creation date)
227
* Distribution of collection sizes and replication levels
228
* Number of collections read
229
* Total Storage Size (overall and per keep server) and space free
230
* Did we get everything we needed? e.g. were all keep servers reachable and all collections retrieved.
231
232 1 Misha Zatsman
233
h3. Revision History
234
235
_The table below should record the major changes to this document. You don't need to add an entry for typo fixes, other small changes or changes before finishing the initial draft._
236
237
|_.Date            |_.Revisions Made |_.Author            |_.Reviewed By     |
238
| June 25, 2014 | Initial Draft         | Misha Zatsman |=. ----              |
239
| July 23, 2014  | Added Reports Produced, Block Types & updated other sections | Misha Zatsman | Tom Clegg, Ward Vandewege, Tim Pierce |