Idea #3408
closed[Keep] Implement Production Data Manager
Added by Misha Zatsman over 10 years ago. Updated over 9 years ago.
Description
Implement the Data Manager as documented at https://arvados.org/projects/orvos-private/wiki/Data_Manager_Design_Doc
Updated by Ward Vandewege over 10 years ago
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-08-27 Sprint
Updated by Tom Clegg over 10 years ago
- Subject changed from Implement Production Data Manager to [Keep] Implement Production Data Manager
Updated by Ward Vandewege over 10 years ago
- Target version changed from 2014-08-27 Sprint to Arvados Future Sprints
Updated by Ward Vandewege over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Updated by Ward Vandewege over 10 years ago
- Target version changed from 2014-09-17 sprint to 2014-10-08 sprint
Updated by Ward Vandewege about 10 years ago
- Status changed from New to In Progress
Updated by Ward Vandewege about 10 years ago
- Target version changed from 2014-10-08 sprint to 2014-10-29 sprint
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Updated by Ward Vandewege about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Ward Vandewege about 10 years ago
- Target version changed from 2014-12-10 sprint to 2015-01-07 sprint
Updated by Tom Clegg almost 10 years ago
- Target version changed from 2015-01-07 sprint to 2015-01-28 Sprint
Updated by Peter Amstutz almost 10 years ago
I want to start by saying this is really great, we have been looking forward to having DataManager for a long time, so don't take my nitpicky comments as a criticism of all your hard work!
- In main()
keepServerInfo := keep.GetKeepServersAndSummarize( keep.GetKeepServersParams{Client: arv, Logger: arvLogger, Limit: 1000}) readCollections := <-collectionChannel // Make compiler happy. _ = readCollections _ = keepServerInfo
You can make compiler happy by just taking out the unreferenced variables:keep.GetKeepServersAndSummarize( keep.GetKeepServersParams{Client: arv, Logger: arvLogger, Limit: 1000}) <-collectionChannel
- For consistency with other SDKs, consider changing the term "ManifestLine" to "ManifestStream"
LineIter()/BlockIterWithDuplicates()
are using unbuffered channels, so there isn't much opportunity for concurrency.- In
GetCollections()
, you could create a goroutine that issues the API query, and sends results over a channel to a 2nd goroutine that processes the result. This would allow for concurrent download and processing of collections. NumberCollectionsAvailable()
is not specific to collections (except for the first parameter toclient.List()
) and could be moved into the Go SDK as a utility function e.g.NumberItemsAvailable(client, resource)
- Instead of using "accessible",
GetKeepServers
should use client.List("keep_services") and filter onservice_type = disk
- It looks like
GetKeepServers
will fail if there are more than the default limit of keep_services? I see there is a TODO there. - It was able to compile datamanager on the first try with no errors. Awesome! When I ran it, it sat there for a couple of seconds, then errored out:
$ bin/datamanager 2015/01/21 14:19:21 Received 2 of 2 available keep server addresses. 2015/01/21 14:19:21 Received keep services list: {2 [petere1:25108 petere1:25107]} 2015/01/21 14:19:21 Got Server Addresses: {true [petere1:25108 petere1:25107] map[petere1:25107:1 petere1:25108:0] map[] map[] map[]} 2015/01/21 14:19:21 About to fetch keep server contents from http://petere1:25108/index Usage of bin/datamanager: -data-manager-token-file="": File with the API token we should use to contact keep servers. -heap-profile="": File to write the heap profiles to. Leave blank to skip profiling. -log-event-type="experimental-data-manager-report": event_type to use in our arvados log entries. Set to empty to turn off logging -log-frequency-seconds=20: How frequently we'll write log entries in seconds. 2015/01/21 14:19:21 About to fetch keep server contents from http://petere1:25107/index 2015/01/21 14:19:21 Data Manager Token needed, but data manager token file not specified.
This actually starts the goroutine to access the API server, then fails a few moments later because it is missing the token required to talk to the Keep servers. It should probably check flags before doing anything else. - Consider identifying keep services by uuid instead of host:port
- "dupicates_seen" misspelled should be "duplicates_seen"
- the keep servers report should include number of blocks on the service, disk space in use, total disk space.
- Keep servers report should have a per-volume (i.e. per disk) breakdown
- This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report
-
parseBlockInfoFromIndexLine()
:blockInfo.Mtime, err = strconv.Atoi(tokens[1])
strconv.Atoi()
returns an int which is can be either 32 or 64 bits. To guarantee there is no y2038 problem, it should use strconv.ParseInt() which returns int64. - Unlike the shared secret signing token, the data manager token gets sent around and may have opportunities to leak. It would be more secure to take the shared secret and hash it together with each keep service's UUID, and use that to get a different Authorization token per service. This way, a token leaked from the HTTP session would only grant access to the one keep service that leaked instead of every keep service. (this is probably not a huge vunerability if everything is running over HTTPS, but I thought I would mention it.)
- What is the purpose of writing partial results while DataManager is running instead of just waiting until the end to write final results? Monitoring? Tools which read back the reports (such as to produce usage graphs) will need to handle situations where multiple log entries represent different points in time of a single run, and correctly identify the final results. Maybe tag the log records as "partial" or "final"?
Updated by Misha Zatsman almost 10 years ago
Thanks Peter! I've responded to all your comments. I'm gonna work on you request for better keep stats now, but I just wanted to get the rest of the changes to you in the meantime. I've pushed the latest version of my code.
Peter Amstutz wrote:
I want to start by saying this is really great, we have been looking forward to having DataManager for a long time, so don't take my nitpicky comments as a criticism of all your hard work!
Thanks for doing the review! I love nitpicky comments!
- In main()
[...]
You can make compiler happy by just taking out the unreferenced variables:
[...]
Thanks! The compiler was only part of the motivation. I wanted to keep these around because I'll need them once I write the rest of the datamanager, so I updated the comment.
- For consistency with other SDKs, consider changing the term "ManifestLine" to "ManifestStream"
Done
LineIter()/BlockIterWithDuplicates()
are using unbuffered channels, so there isn't much opportunity for concurrency.
Yeah, I wasn't looking to maximize concurrency. Inspired by Brett's library, I wanted to avoid unneccesary work of parsing data past where we want to read. In the particular case of datamanager, which is doing so much network i/o, I don't expect the concurrency to have a noticable effect on runtime.
- In
GetCollections()
, you could create a goroutine that issues the API query, and sends results over a channel to a 2nd goroutine that processes the result. This would allow for concurrent download and processing of collections.
I need to read the latest modification date from the current batch to form the next request. It's true that I could read that first, and do the rest of the processing concurrently, but I don't think the runtime improvement will be worth the burden of increased complexity on implementation and maintenance time. Again, I think the network i/o time will dominate everything else.
NumberCollectionsAvailable()
is not specific to collections (except for the first parameter toclient.List()
) and could be moved into the Go SDK as a utility function e.g.NumberItemsAvailable(client, resource)
Great idea, I moved it to go/sdk/util
- Instead of using "accessible",
GetKeepServers
should use client.List("keep_services") and filter onservice_type = disk
Changed, thanks
- It looks like
GetKeepServers
will fail if there are more than the default limit of keep_services? I see there is a TODO there.
Yes, that's right. Currently we request a limit of 1000 keep servers. If there are servers available that we don't receive, then we fail because we need complete information about the servers' contents.
- It was able to compile datamanager on the first try with no errors. Awesome! When I ran it, it sat there for a couple of seconds, then errored out:
[...]
This actually starts the goroutine to access the API server, then fails a few moments later because it is missing the token required to talk to the Keep servers. It should probably check flags before doing anything else.
Awesome!
I agree the user experience would be better if it failed faster, but that flag is hidden as an implementation detail in a private package variable inside keep. There are lots of failure cases with the token file: missing flag, flag pointing to missing file, flag pointing to file with invalid token. I think trying to deal with these cases would just make the code more confusing and complicated, when we'll hear about them a few seconds later anyways.
- Consider identifying keep services by uuid instead of host:port
What's the advantage of that? Where are you recommending I make this change?
- "dupicates_seen" misspelled should be "duplicates_seen"
Doh! thanks. I am the worst typr.
- the keep servers report should include number of blocks on the service, disk space in use, total disk space.
- Keep servers report should have a per-volume (i.e. per disk) breakdown
- This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report
Ok, cool, that will take a little more work since I need to fire off another http request per server. I'll do that, but I wanted to respond to the rest of your comments in the meantime.
parseBlockInfoFromIndexLine()
:
[...]strconv.Atoi()
returns an int which is can be either 32 or 64 bits. To guarantee there is no y2038 problem, it should use strconv.ParseInt() which returns int64.
Done, thanks.
- Unlike the shared secret signing token, the data manager token gets sent around and may have opportunities to leak. It would be more secure to take the shared secret and hash it together with each keep service's UUID, and use that to get a different Authorization token per service. This way, a token leaked from the HTTP session would only grant access to the one keep service that leaked instead of every keep service. (this is probably not a huge vunerability if everything is running over HTTPS, but I thought I would mention it.)
Yeah, might be worth discussing with Ward and/or Tom, but probably beyond the scope of this review.
- What is the purpose of writing partial results while DataManager is running instead of just waiting until the end to write final results? Monitoring? Tools which read back the reports (such as to produce usage graphs) will need to handle situations where multiple log entries represent different points in time of a single run, and correctly identify the final results. Maybe tag the log records as "partial" or "final"?
Yeah, it's for monitoring. Also useful for debugging if we crash in the middle.
Log records which have an end time set are the final ones.
Are you suggesting using different event_types for partial and final log records? Or are you suggesting using another field? Either would be easy to do, it's just a matter of which is easier for users. One argument for using another field is that if you want to get both types in one query then you still can. Which fields can we filter on?
Updated by Peter Amstutz almost 10 years ago
In logger Update(), ForceUpdate() and several other places, prefer the following pattern (note the l.lock.Unlock() moved from the end of the function to the defer statement):
func (l *Logger) Update(mutator LogMutator) { l.lock.Lock() defer l.lock.Unlock(); mutator(l.properties, l.entry) l.modified = true // We assume the mutator modified the log, even though we don't know for sure. l.considerWriting() }
defer
automates cleanup on function exit and reduces the chances that you'll exit the function with the lock still locked.
However, it would be more idiomatic Go to use a channel and goroutines instead of locks: producers send change requests into the channel, the log writer sits in a goroutine that reads from the channel and updates the log structure. A 2nd goroutine could sit in a loop, sleep() and then send a "heartbeat" on a 2nd channel to the log writer routine to flush pending commits.
New Go packages with tests should be added to arvados-dev/jenkins/run-tests.sh (https://arvados.org/projects/arvados/repository/arvados-dev/revisions/master/show/jenkins)
Consider identifying keep services by uuid instead of host:port
What's the advantage of that? Where are you recommending I make this change?
For continuity over time. Host and/or port might change due to network topology, but the UUID should continue refer to the same physical node.
Are you suggesting using different event_types for partial and final log records? Or are you suggesting using another field? Either would be easy to do, it's just a matter of which is easier for users. One argument for using another field is that if you want to get both types in one query then you still can. Which fields can we filter on?
I think setting different event types would be a good idea. I suggest event types "data-manager-start", "data-manager-error", "data-manager-partial-report", and "data-manager-final-report". Unfortunately, we can't query on nested fields (e.g. the "properties" field of the log record) until #4019 happens.
On a side note, if you use Emacs, you can load go-mode.el and add (add-hook 'before-save-hook 'gofmt-before-save)
to automatically gofmt your code whenever you save.
Updated by Ward Vandewege almost 10 years ago
- Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint
Updated by Misha Zatsman almost 10 years ago
- Target version changed from 2015-02-18 sprint to 2015-01-28 Sprint
Peter Amstutz wrote:
- the keep servers report should include number of blocks on the service, disk space in use, total disk space.
- Keep servers report should have a per-volume (i.e. per disk) breakdown
- This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report
I added the keep servers status.json and pushed the code.
Still need to add number of blocks and disk space stats for whole keep server.
Will report stats from both status and index file to see whether they agree.
Updated by Ward Vandewege almost 10 years ago
- Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint
Updated by Peter Amstutz almost 10 years ago
- The Arvados convention for timestamps is to end in "_at" so runInfo["started_at"] instead of runInfo["time_start"] (and "finished_at" instead of "time_finished")
- In logger.go NewLogger(), you should be able to initialize more of the fields in the struct constructor:
l := &Logger{ data: make(map[string]interface{}), params: params, entry: make(map[string]interface{}), properties: make(map[string]interface{}), workToDo: make(chan LogMutator, 10), writeTicker: time.NewTicker(params.WriteInterval)} l.data["log"] = l.entry l.entry["properties"] = l.properties
- In logger.go work() you can test if the channel is closed using the 2nd return value "valid" to stop the goroutine:
func (l *Logger) work() { for { select { case <-l.writeTicker.C: if l.modified { l.write(false) l.modified = false } case mutator, valid := <-l.workToDo: if !valid { return } mutator(l.properties, l.entry) l.modified = true } } }
- Having write() block the logger work loop is not that big a deal as long as the mutator queue is long enough to soak up any pending changes and not block the data producer thread
- As I suggested previously, Keep services should be reported based on their service table uuid, with the host/port as secondary fields. Host and port may change based on cluster configuration but the UUID is stable, which is necessary to create reports that graph usage over time.
- The report should include the total size of data managed under Keep:
- total size of all blocks stored across all keep servers
- deduplicated total (total size of all distinct blocks)
- The per-user/per-group data size reporting should break down this way as well
- It doesn't correlate the set of blocks listed in the collections owned by a user with the set of blocks actually stored in Keep as reported by keepstore. In my setup, I have collections records copied from the public beta server which total to 7 TiB, but I don't actually have 7 TiB of data in my development Keep instance. This is important because in the future we intend to have federated storage, where blocks may be stored offsite -- blocks which are not actually stored on the cluster shouldn't be counted against the user's disk usage.
- If datamanager starts and immediately fails (such as a missing "-data-manager-token-file" command) it still writes a "data-manager-final" report which is a little confusing (this is why I suggested a distinct "error" event type).
- Why is "Uuid to Size used" logged to the console in GetCollectionsAndSummarize()? It is already written out to the final report and doesn't seem useful in a "tell me what the service is doing" kind of way (maybe useful for debugging?)
- Please merge with master
I'd do want to move this along towards merging so let me know if you feel like any of these comments are out of scope / too much work to hold up the merge.
Updated by Misha Zatsman almost 10 years ago
Peter Amstutz wrote:
- The Arvados convention for timestamps is to end in "_at" so runInfo["started_at"] instead of runInfo["time_start"] (and "finished_at" instead of "time_finished")
ok, changed. Where are these conventions recorded so next time I can check them myself?
- In logger.go NewLogger(), you should be able to initialize more of the fields in the struct constructor:
[...]
Done, thanks!
- In logger.go work() you can test if the channel is closed using the 2nd return value "valid" to stop the goroutine:
[...]
Writing to a closed channel will cause a panic. How do I prevent other goroutines from calling logger.Update() after I've closed the channel? Reintroduce the a lock that we replaced with channels?
- Having write() block the logger work loop is not that big a deal as long as the mutator queue is long enough to soak up any pending changes and not block the data producer thread
ok
- As I suggested previously, Keep services should be reported based on their service table uuid, with the host/port as secondary fields. Host and port may change based on cluster configuration but the UUID is stable, which is necessary to create reports that graph usage over time.
Changed, thanks for the reminder.
- The report should include the total size of data managed under Keep:
- total size of all blocks stored across all keep servers
- deduplicated total (total size of all distinct blocks)
- The per-user/per-group data size reporting should break down this way as well
- It doesn't correlate the set of blocks listed in the collections owned by a user with the set of blocks actually stored in Keep as reported by keepstore. In my setup, I have collections records copied from the public beta server which total to 7 TiB, but I don't actually have 7 TiB of data in my development Keep instance. This is important because in the future we intend to have federated storage, where blocks may be stored offsite -- blocks which are not actually stored on the cluster shouldn't be counted against the user's disk usage.
These are great feature requests. Feel free to add them to https://arvados.org/projects/arvados/wiki/Data_Manager_Design_Doc#Additional-Reports
- If datamanager starts and immediately fails (such as a missing "-data-manager-token-file" command) it still writes a "data-manager-final" report which is a little confusing (this is why I suggested a distinct "error" event type).
Yup, I proposed this over email and explicitly called out the fact that the final event type is used regardless of an error or a regular completion. You wrote back and said it sounds good.
- Why is "Uuid to Size used" logged to the console in GetCollectionsAndSummarize()? It is already written out to the final report and doesn't seem useful in a "tell me what the service is doing" kind of way (maybe useful for debugging?)
Useful for debugging, just to see that things are happening.
- Please merge with master
Done, after much git wrangling with Tom.
I'd do want to move this along towards merging so let me know if you feel like any of these comments are out of scope / too much work to hold up the merge.
Done, thanks!
Updated by Ward Vandewege almost 10 years ago
- Target version changed from 2015-02-18 sprint to 2015-03-11 sprint
Updated by Ward Vandewege almost 10 years ago
- Story points changed from 5.0 to 3.0
Updated by Ward Vandewege almost 10 years ago
- Target version changed from 2015-03-11 sprint to 2015-04-01 sprint
Updated by Ward Vandewege almost 10 years ago
- Target version changed from 2015-04-01 sprint to 2015-04-29 sprint
Updated by Ward Vandewege over 9 years ago
- Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-05-20 sprint to 2015-06-10 sprint
Updated by Peter Amstutz over 9 years ago
Reviewing 5824ee2
- "DefaultReplicationLevel" should come from the "defaultCollectionReplication" key in the discovery document (GET /discovery/v1/apis/arvados/v1/rest from the API server, we should probably add a SDK method to arvadosclient for convenience) instead of being a constant.
- Instead of "MaybeReadData" and "MaybeWriteData" in singlerun(), perhaps have a "var GetKeepData" which is a pointer to a function which is set to implement the appropriate behavior for testing/development/production?
- Data manager should be aware of the new read only flag on the keep_services table (implemented last sprint in #4717) and take that into account when building pull lists in ComputePullServers. (see kc.WritableLocalRoots())
- Also in ComputePullServers,
if numCopies == 0
that should be logged since that's potentially a serious problem (since that means data is missing.)
- If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.
- Please put "WritePullLists" behind a flag, so it is optionally enabled (obviously it's only intended for development/testing, but it's possible it might be useful for production logging.)
That's my high level comments so far, I'll continue looking at it tomorrow
Updated by Peter Amstutz over 9 years ago
More comments.
In BuildPullLists
, I think the assignment spl[destination] = pullList
inside the if !pullListExists
is redundant.
pullList, pullListExists := spl[destination] if !pullListExists { pullList = PullList{} spl[destination] = pullList } pullList = append(pullList, PullRequest{Locator: locator, Servers: pullServers.From}) spl[destination] = pullList
Looking at the keepstore code, I think it is expecting the url scheme (http://
or https://
) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.
For the longer term design, if you're having heap size issues storing all the blocks, would it make sense to look at using an embedded database such as sqlite or https://github.com/boltdb/bolt ?
Updated by Peter Amstutz over 9 years ago
I ran it without any arguments and here's what I got:
$ ./datamanager 2015/05/22 11:11:21 50 collections read, 50 new in last batch, 2013-07-10T08:01:20Z latest modified date, 322 557 16113 avg,max,total manifest size 2015/05/22 11:11:21 Received keep services list: {ItemsAvailable:2 KeepServers:[petere1:25107 petere1:25108]} 2015/05/22 11:11:21 Got Server Addresses: {false [petere1:25107 petere1:25108] map[petere1:25107:0 petere1:25108:1] map[] map[] map[]} 2015/05/22 11:11:21 About to fetch keep server contents from http://petere1:25107/index Usage of ./datamanager: -data-manager-token-file="": File with the API token we should use to contact keep servers. -heap-profile="": File to write the heap profiles to. Leave blank to skip profiling. -log-event-type-prefix="experimental-data-manager": Prefix to use in the event_type of our arvados log entries. Set to empty to turn off logging -log-frequency-seconds=20: How frequently we'll write log entries in seconds. -minutes-between-runs=0: How many minutes we wait betwen data manager runs. 0 means run once and exit. -read-data-from="": Avoid network i/o and read summary data from this file instead. Used for development only. -write-data-to="": Write summary of data received to this file. Used for development only. 2015/05/22 11:11:21 About to fetch keep server contents from http://petere1:25108/index 2015/05/22 11:11:21 Data Manager Token needed, but data manager token file not specified.
If command line arguments are required, it should fail up front instead of starting to do stuff and then failing later on.
Updated by Peter Amstutz over 9 years ago
Also, API server provides microsecond timestamps, but it looks like datamanager is truncating that to whole seconds. Using the full precision should make it easier to avoid fetching collections redundantly as there should be fewer cases of collections having exactly the same timestamp.
Updated by Peter Amstutz over 9 years ago
Okay, I was able to run data manager against my local development instance and get a report.
Scratch one of my prior comments, I see now that MaybeReadData
and MaybeWriteData
are already optional on the command line behind "-read-data-from" and "-write-data-to".
Updated by Misha Zatsman over 9 years ago
Peter Amstutz wrote:
Reviewing 5824ee2
Thanks!
- "DefaultReplicationLevel" should come from the "defaultCollectionReplication" key in the discovery document (GET /discovery/v1/apis/arvados/v1/rest from the API server, we should probably add a SDK method to arvadosclient for convenience) instead of being a constant.
Sounds great, I added it as a todo for when the SDK method is added.
- Instead of "MaybeReadData" and "MaybeWriteData" in singlerun(), perhaps have a "var GetKeepData" which is a pointer to a function which is set to implement the appropriate behavior for testing/development/production?
Good call, lemme know what you think of my implementation. I had to use a builder, so it's a little clumsier than I thought it would be at first. datamanager.go
and file.go
are the two files affected by this change.
- Data manager should be aware of the new read only flag on the keep_services table (implemented last sprint in #4717) and take that into account when building pull lists in ComputePullServers. (see kc.WritableLocalRoots())
Thanks, I'm using WriteableLocalRoots now. This required changing the logic (and signature) of CreatePullServers
somewhat.
- Also in ComputePullServers,
if numCopies == 0
that should be logged since that's potentially a serious problem (since that means data is missing.)
Missing data is actually detected upstream in SummarizeBuckets in summary.go
0 numCopies
here would indicate a logic error, since we're iterating over blocks which exist in keep, but at replication levels lower than the collections require.
FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections
- If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.
Where do you want this logged? In the log entry that we write to the API server? Or to the console?
- Please put "WritePullLists" behind a flag, so it is optionally enabled (obviously it's only intended for development/testing, but it's possible it might be useful for production logging.)
That is how I wrote it.
In
BuildPullLists
, I think the assignmentspl[destination] = pullList
inside theif !pullListExists
is redundant.
Fixed, thanks!
Looking at the keepstore code, I think it is expecting the url scheme (
http://
orhttps://
) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.
Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.
For the longer term design, if you're having heap size issues storing all the blocks, would it make sense to look at using an embedded database such as sqlite or https://github.com/boltdb/bolt ?
I thought I was having heap size issues, but it turns out the problem was actually on the api server.
If command line arguments are required, it should fail up front instead of starting to do stuff and then failing later on.
I just ran it and it failed after 135 ms. That seems fast enough to me.
Flags are defined in the files that use them, there is no global view of which flags are needed because it is a complex question. For example, If you are reading summary data from a file, you don't need a datamanager token.
I like the ease of maintenance of the current setup and I think it outweighs the cost of having a few extra lines for the reader to read. We'll never be able to pre-check all conditions (token file specified, token file exists, token file readable by user, etc...) so we have to accept that sometimes we have to discover these failures while running . Trying to explictly list what flag combinations we need runs the risk of having those rules grow outdated vs our actual needs.
Also, API server provides microsecond timestamps, but it looks like datamanager is truncating that to whole seconds. Using the full precision should make it easier to avoid fetching collections redundantly as there should be fewer cases of collections having exactly the same timestamp.
Where did it look like we were truncating to whole seconds? I think what your seeing is that the data is stored that way in the api server.
#6076 Sets out to fix the problem on the api server.
Updated by Peter Amstutz over 9 years ago
Misha Zatsman wrote:
Sounds great, I added it as a todo for when the SDK method is added.
Can you review the branch 6235-go-sdk-discovery
? This adds the relevant SDK method.
Good call, lemme know what you think of my implementation. I had to use a builder, so it's a little clumsier than I thought it would be at first.
datamanager.go
andfile.go
are the two files affected by this change.
Looks good, thanks.
- Also in ComputePullServers,
if numCopies == 0
that should be logged since that's potentially a serious problem (since that means data is missing.)Missing data is actually detected upstream in SummarizeBuckets in summary.go
I see now. SummarizeBuckets categorizes the blocks into various BlockSets.
I'm confused by readCollections.BlockToReplication
, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.
FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections
- If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.
Where do you want this logged? In the log entry that we write to the API server? Or to the console?
So, we need to be able to get at above information. Spamming the console log may not be the best way, but it does need to be provided because it should be a red flag for an admin that something is amiss.
Looking at the keepstore code, I think it is expecting the url scheme (
http://
orhttps://
) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.
Tom and I agreed that it should include "http://" or "https://". The spec has been updated, so please update DataManager.
Updated by Misha Zatsman over 9 years ago
Peter Amstutz wrote:
Misha Zatsman wrote:
Sounds great, I added it as a todo for when the SDK method is added.
Can you review the branch
6235-go-sdk-discovery
? This adds the relevant SDK method.
Reviewed, I just had one question.
Updated by Peter Amstutz over 9 years ago
Misha Zatsman wrote:
Peter Amstutz wrote:
Misha Zatsman wrote:
Sounds great, I added it as a todo for when the SDK method is added.
Can you review the branch
6235-go-sdk-discovery
? This adds the relevant SDK method.Reviewed, I just had one question.
Responded on #6235
Also reviewed 3408-add-size-to-locators. Go ahead and merge that with 3408-production-datamanager. Let me know when you've responded to my other review comments in note 37. Finally, I'll talk to Brett about adding story for next sprint to set up some kind of integration test between data manager, the API server and the keep servers.
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-06-10 sprint to 2015-07-08 sprint
Updated by Misha Zatsman over 9 years ago
Peter Amstutz wrote:
Can you review the branch
6235-go-sdk-discovery
? This adds the relevant SDK method.
I was surprised to find that this value is type float64, I added a comment to #6235.
I'm confused by
readCollections.BlockToReplication
, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.
Renamed and added a note.
FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections
- If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.
Where do you want this logged? In the log entry that we write to the API server? Or to the console?
So, we need to be able to get at above information. Spamming the console log may not be the best way, but it does need to be provided because it should be a red flag for an admin that something is amiss.
Totaly agree it's a red flag. The counts appear in the log entry under summary_info, so we can monitor those values for any kind of alerting we want to do.
Looking at the keepstore code, I think it is expecting the url scheme (
http://
orhttps://
) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.
Tom and I agreed that it should include "http://" or "https://". The spec has been updated, so please update DataManager.
Done. This wasted effort of rewriting could have been avoided if whoever decided to change the implementation had also updated the spec. Preventing waste is exactly why I took the time to write a spec in the first place.
Also reviewed 3408-add-size-to-locators. Go ahead and merge that with 3408-production-datamanager.
Merged.
I just ran the datamanger a bunch of times and it seems to be reliably working. It finishes super fast too (under ten minutes), which is great after a year of watching it take forever.
Let me know what you think, I believe this is ready to merge.
Updated by Peter Amstutz over 9 years ago
Misha Zatsman wrote:
Peter Amstutz wrote:
Can you review the branch
6235-go-sdk-discovery
? This adds the relevant SDK method.I was surprised to find that this value is type float64, I added a comment to #6235.
Responded on #6235
I'm confused by
readCollections.BlockToReplication
, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.Renamed and added a note.
Thanks.
Totaly agree it's a red flag. The counts appear in the log entry under summary_info, so we can monitor those values for any kind of alerting we want to do.
Ok. The team has been discussing the need for a real monitoring/alerting infrastructure so this can go on the list of things to monitor.
Done. This wasted effort of rewriting could have been avoided if whoever decided to change the implementation had also updated the spec. Preventing waste is exactly why I took the time to write a spec in the first place.
Well, I agree that ideally the question of "how does the keep service know which protocol to use to contact another keep service" should have come up when the spec was reviewed, but that was an oversight.
I just ran the datamanger a bunch of times and it seems to be reliably working. It finishes super fast too (under ten minutes), which is great after a year of watching it take forever.
That's very good to hear, several significant performance fixes for handling collections the API server were merged in the last sprint so it is very good news that it translates into better performance for datamanager.
Let me know what you think, I believe this is ready to merge.
Er, tell me if I'm just overlooking it, but this appears to not yet have the actual code to POST the pull list to each keep service?
Possibly we could do that in #6260 (data manager + keepstore integration test) since the purpose of the integration test is to validate that the data manager and keepstore can actually work together to execute the pull list behavior successfully.
I will talk to Brett tomorrow to decide how we want to proceed.
Updated by Brett Smith over 9 years ago
- Assigned To changed from Misha Zatsman to Peter Amstutz
Peter, please sync up with me about the current state of this branch, and let's figure out next steps for it.
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-07-08 sprint to 2015-07-22 sprint
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Rejected
To spell out a little better what should happen with the issue this sprint:
- Does the outstanding Data Manager branch have code to generate trash lists? If not, let's assess what it would take to add that, and consider doing it if it can be done in an ideal development day.
- If it gets to that point this sprint, Peter should take point on actually merging the branch to master.
Updated by Brett Smith over 9 years ago
- Status changed from Rejected to In Progress
Updated by Peter Amstutz over 9 years ago
- Assigned To changed from Peter Amstutz to Paolo Di Tommaso
- Status changed from In Progress to Resolved
Marking as resolved because the last push from Misha is merged and further development will occur through new stories instead of continuing this epic story.
Updated by Peter Amstutz over 9 years ago
- Assigned To changed from Paolo Di Tommaso to Peter Amstutz