Project

General

Profile

Actions

Idea #3408

closed

[Keep] Implement Production Data Manager

Added by Misha Zatsman over 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
07/29/2014
Due date:
Story points:
1.0

Description


Subtasks 3 (0 open3 closed)

Task #6613: Merge 3408ResolvedPeter Amstutz07/29/2014Actions
Task #6108: Review data managerResolvedMisha Zatsman07/29/2014Actions
Task #4941: Review 3408-go-sdk-api-errorsResolvedMisha Zatsman07/29/2014Actions

Related issues

Related to Arvados - Bug #6076: [API] walk api server installations and ensure modified_at for collections is unique + ensure modified_at is enforced to be unique at the api level.NewActions
Related to Arvados - Idea #6260: [Keep] Integration test between data manager, API and keepstore.ResolvedRadhika Chippada08/19/2015Actions
Blocked by Arvados - Idea #3409: Use collection owner fieldRejected07/29/2014Actions
Blocked by Arvados - Idea #3410: [API] Rename collection "desired replication" attributes, add model constraintsResolvedTom Clegg07/29/2014Actions
Blocked by Arvados - Idea #3411: [API] Implement Trash behavior using collection expirationResolvedTim Pierce07/29/2014Actions
Blocked by Arvados - Idea #3412: [API] Full collection view in list returned by api serverResolvedBrett Smith08/08/2014Actions
Blocked by Arvados - Idea #3413: [Keep] Keep Server accepts list of trashable blocksResolvedTim Pierce07/29/2014Actions
Blocked by Arvados - Idea #3414: [Keep] Keep Server accepts list of blocks to pull from other serversResolvedTim Pierce08/21/2014Actions
Blocked by Arvados - Bug #5834: [API] Be careful about memory usage when responding to collections#index queriesResolvedBrett Smith04/30/2015Actions
Blocked by Arvados - Bug #6074: [API] collections are being returned too slowly + NoMemoryErrorResolvedTom Clegg06/03/2015Actions
Actions #1

Updated by Ward Vandewege over 9 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-08-27 Sprint
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Story points set to 8.0
Actions #4

Updated by Ward Vandewege over 9 years ago

  • Story points deleted (8.0)
Actions #5

Updated by Tom Clegg over 9 years ago

  • Subject changed from Implement Production Data Manager to [Keep] Implement Production Data Manager
Actions #6

Updated by Tim Pierce over 9 years ago

  • Category set to Keep
Actions #7

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-08-27 Sprint to Arvados Future Sprints
Actions #8

Updated by Ward Vandewege over 9 years ago

  • Story points set to 5.0
Actions #9

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Actions #10

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-09-17 sprint to 2014-10-08 sprint
Actions #11

Updated by Ward Vandewege over 9 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-10-08 sprint to 2014-10-29 sprint
Actions #13

Updated by Tom Clegg over 9 years ago

  • Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Actions #14

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Actions #15

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-12-10 sprint to 2015-01-07 sprint
Actions #16

Updated by Tom Clegg over 9 years ago

  • Target version changed from 2015-01-07 sprint to 2015-01-28 Sprint
Actions #17

Updated by Peter Amstutz over 9 years ago

I want to start by saying this is really great, we have been looking forward to having DataManager for a long time, so don't take my nitpicky comments as a criticism of all your hard work!

  • In main()
        keepServerInfo := keep.GetKeepServersAndSummarize(
            keep.GetKeepServersParams{Client: arv, Logger: arvLogger, Limit: 1000})
    
        readCollections := <-collectionChannel
    
        // Make compiler happy.
        _ = readCollections
        _ = keepServerInfo
    

    You can make compiler happy by just taking out the unreferenced variables:
        keep.GetKeepServersAndSummarize(
            keep.GetKeepServersParams{Client: arv, Logger: arvLogger, Limit: 1000})
    
        <-collectionChannel
    
  • For consistency with other SDKs, consider changing the term "ManifestLine" to "ManifestStream"
  • LineIter()/BlockIterWithDuplicates() are using unbuffered channels, so there isn't much opportunity for concurrency.
  • In GetCollections(), you could create a goroutine that issues the API query, and sends results over a channel to a 2nd goroutine that processes the result. This would allow for concurrent download and processing of collections.
  • NumberCollectionsAvailable() is not specific to collections (except for the first parameter to client.List()) and could be moved into the Go SDK as a utility function e.g. NumberItemsAvailable(client, resource)
  • Instead of using "accessible", GetKeepServers should use client.List("keep_services") and filter on service_type = disk
  • It looks like GetKeepServers will fail if there are more than the default limit of keep_services? I see there is a TODO there.
  • It was able to compile datamanager on the first try with no errors. Awesome! When I ran it, it sat there for a couple of seconds, then errored out:
    $ bin/datamanager
    2015/01/21 14:19:21 Received 2 of 2 available keep server addresses.
    2015/01/21 14:19:21 Received keep services list: {2 [petere1:25108 petere1:25107]}
    2015/01/21 14:19:21 Got Server Addresses: {true [petere1:25108 petere1:25107] map[petere1:25107:1 petere1:25108:0] map[] map[] map[]}
    2015/01/21 14:19:21 About to fetch keep server contents from http://petere1:25108/index
    Usage of bin/datamanager:
      -data-manager-token-file="": File with the API token we should use to contact keep servers.
      -heap-profile="": File to write the heap profiles to. Leave blank to skip profiling.
      -log-event-type="experimental-data-manager-report": event_type to use in our arvados log entries. Set to empty to turn off logging
      -log-frequency-seconds=20: How frequently we'll write log entries in seconds.
    2015/01/21 14:19:21 About to fetch keep server contents from http://petere1:25107/index
    2015/01/21 14:19:21 Data Manager Token needed, but data manager token file not specified.
    

    This actually starts the goroutine to access the API server, then fails a few moments later because it is missing the token required to talk to the Keep servers. It should probably check flags before doing anything else.
  • Consider identifying keep services by uuid instead of host:port
  • "dupicates_seen" misspelled should be "duplicates_seen"
  • the keep servers report should include number of blocks on the service, disk space in use, total disk space.
    • Keep servers report should have a per-volume (i.e. per disk) breakdown
    • This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report
  • parseBlockInfoFromIndexLine():
    blockInfo.Mtime, err = strconv.Atoi(tokens[1])
    

    strconv.Atoi() returns an int which is can be either 32 or 64 bits. To guarantee there is no y2038 problem, it should use strconv.ParseInt() which returns int64.
  • Unlike the shared secret signing token, the data manager token gets sent around and may have opportunities to leak. It would be more secure to take the shared secret and hash it together with each keep service's UUID, and use that to get a different Authorization token per service. This way, a token leaked from the HTTP session would only grant access to the one keep service that leaked instead of every keep service. (this is probably not a huge vunerability if everything is running over HTTPS, but I thought I would mention it.)
  • What is the purpose of writing partial results while DataManager is running instead of just waiting until the end to write final results? Monitoring? Tools which read back the reports (such as to produce usage graphs) will need to handle situations where multiple log entries represent different points in time of a single run, and correctly identify the final results. Maybe tag the log records as "partial" or "final"?
Actions #18

Updated by Misha Zatsman about 9 years ago

Thanks Peter! I've responded to all your comments. I'm gonna work on you request for better keep stats now, but I just wanted to get the rest of the changes to you in the meantime. I've pushed the latest version of my code.

Peter Amstutz wrote:

I want to start by saying this is really great, we have been looking forward to having DataManager for a long time, so don't take my nitpicky comments as a criticism of all your hard work!

Thanks for doing the review! I love nitpicky comments!

  • In main()
    [...]
    You can make compiler happy by just taking out the unreferenced variables:
    [...]

Thanks! The compiler was only part of the motivation. I wanted to keep these around because I'll need them once I write the rest of the datamanager, so I updated the comment.

  • For consistency with other SDKs, consider changing the term "ManifestLine" to "ManifestStream"

Done

  • LineIter()/BlockIterWithDuplicates() are using unbuffered channels, so there isn't much opportunity for concurrency.

Yeah, I wasn't looking to maximize concurrency. Inspired by Brett's library, I wanted to avoid unneccesary work of parsing data past where we want to read. In the particular case of datamanager, which is doing so much network i/o, I don't expect the concurrency to have a noticable effect on runtime.

  • In GetCollections(), you could create a goroutine that issues the API query, and sends results over a channel to a 2nd goroutine that processes the result. This would allow for concurrent download and processing of collections.

I need to read the latest modification date from the current batch to form the next request. It's true that I could read that first, and do the rest of the processing concurrently, but I don't think the runtime improvement will be worth the burden of increased complexity on implementation and maintenance time. Again, I think the network i/o time will dominate everything else.

  • NumberCollectionsAvailable() is not specific to collections (except for the first parameter to client.List()) and could be moved into the Go SDK as a utility function e.g. NumberItemsAvailable(client, resource)

Great idea, I moved it to go/sdk/util

  • Instead of using "accessible", GetKeepServers should use client.List("keep_services") and filter on service_type = disk

Changed, thanks

  • It looks like GetKeepServers will fail if there are more than the default limit of keep_services? I see there is a TODO there.

Yes, that's right. Currently we request a limit of 1000 keep servers. If there are servers available that we don't receive, then we fail because we need complete information about the servers' contents.

  • It was able to compile datamanager on the first try with no errors. Awesome! When I ran it, it sat there for a couple of seconds, then errored out:
    [...]
    This actually starts the goroutine to access the API server, then fails a few moments later because it is missing the token required to talk to the Keep servers. It should probably check flags before doing anything else.

Awesome!

I agree the user experience would be better if it failed faster, but that flag is hidden as an implementation detail in a private package variable inside keep. There are lots of failure cases with the token file: missing flag, flag pointing to missing file, flag pointing to file with invalid token. I think trying to deal with these cases would just make the code more confusing and complicated, when we'll hear about them a few seconds later anyways.

  • Consider identifying keep services by uuid instead of host:port

What's the advantage of that? Where are you recommending I make this change?

  • "dupicates_seen" misspelled should be "duplicates_seen"

Doh! thanks. I am the worst typr.

  • the keep servers report should include number of blocks on the service, disk space in use, total disk space.
    • Keep servers report should have a per-volume (i.e. per disk) breakdown
    • This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report

Ok, cool, that will take a little more work since I need to fire off another http request per server. I'll do that, but I wanted to respond to the rest of your comments in the meantime.

  • parseBlockInfoFromIndexLine():
    [...]
    strconv.Atoi() returns an int which is can be either 32 or 64 bits. To guarantee there is no y2038 problem, it should use strconv.ParseInt() which returns int64.

Done, thanks.

  • Unlike the shared secret signing token, the data manager token gets sent around and may have opportunities to leak. It would be more secure to take the shared secret and hash it together with each keep service's UUID, and use that to get a different Authorization token per service. This way, a token leaked from the HTTP session would only grant access to the one keep service that leaked instead of every keep service. (this is probably not a huge vunerability if everything is running over HTTPS, but I thought I would mention it.)

Yeah, might be worth discussing with Ward and/or Tom, but probably beyond the scope of this review.

  • What is the purpose of writing partial results while DataManager is running instead of just waiting until the end to write final results? Monitoring? Tools which read back the reports (such as to produce usage graphs) will need to handle situations where multiple log entries represent different points in time of a single run, and correctly identify the final results. Maybe tag the log records as "partial" or "final"?

Yeah, it's for monitoring. Also useful for debugging if we crash in the middle.

Log records which have an end time set are the final ones.

Are you suggesting using different event_types for partial and final log records? Or are you suggesting using another field? Either would be easy to do, it's just a matter of which is easier for users. One argument for using another field is that if you want to get both types in one query then you still can. Which fields can we filter on?

Actions #19

Updated by Peter Amstutz about 9 years ago

In logger Update(), ForceUpdate() and several other places, prefer the following pattern (note the l.lock.Unlock() moved from the end of the function to the defer statement):

func (l *Logger) Update(mutator LogMutator) {
    l.lock.Lock()
    defer l.lock.Unlock();

    mutator(l.properties, l.entry)
    l.modified = true // We assume the mutator modified the log, even though we don't know for sure.

    l.considerWriting()
}

defer automates cleanup on function exit and reduces the chances that you'll exit the function with the lock still locked.

However, it would be more idiomatic Go to use a channel and goroutines instead of locks: producers send change requests into the channel, the log writer sits in a goroutine that reads from the channel and updates the log structure. A 2nd goroutine could sit in a loop, sleep() and then send a "heartbeat" on a 2nd channel to the log writer routine to flush pending commits.

New Go packages with tests should be added to arvados-dev/jenkins/run-tests.sh (https://arvados.org/projects/arvados/repository/arvados-dev/revisions/master/show/jenkins)

Consider identifying keep services by uuid instead of host:port

What's the advantage of that? Where are you recommending I make this change?

For continuity over time. Host and/or port might change due to network topology, but the UUID should continue refer to the same physical node.

Are you suggesting using different event_types for partial and final log records? Or are you suggesting using another field? Either would be easy to do, it's just a matter of which is easier for users. One argument for using another field is that if you want to get both types in one query then you still can. Which fields can we filter on?

I think setting different event types would be a good idea. I suggest event types "data-manager-start", "data-manager-error", "data-manager-partial-report", and "data-manager-final-report". Unfortunately, we can't query on nested fields (e.g. the "properties" field of the log record) until #4019 happens.

On a side note, if you use Emacs, you can load go-mode.el and add (add-hook 'before-save-hook 'gofmt-before-save) to automatically gofmt your code whenever you save.

Actions #20

Updated by Ward Vandewege about 9 years ago

  • Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint
Actions #21

Updated by Misha Zatsman about 9 years ago

  • Target version changed from 2015-02-18 sprint to 2015-01-28 Sprint

Peter Amstutz wrote:

  • the keep servers report should include number of blocks on the service, disk space in use, total disk space.
    • Keep servers report should have a per-volume (i.e. per disk) breakdown
    • This could be as simple as just adding the contents of /status.json reported by keepstore to the keep services report

I added the keep servers status.json and pushed the code.

Still need to add number of blocks and disk space stats for whole keep server.

Will report stats from both status and index file to see whether they agree.

Actions #22

Updated by Ward Vandewege about 9 years ago

  • Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint
Actions #23

Updated by Peter Amstutz about 9 years ago

  • The Arvados convention for timestamps is to end in "_at" so runInfo["started_at"] instead of runInfo["time_start"] (and "finished_at" instead of "time_finished")
  • In logger.go NewLogger(), you should be able to initialize more of the fields in the struct constructor:
        l := &Logger{
                    data: make(map[string]interface{}),
                    params: params,
                    entry: make(map[string]interface{}),
                    properties: make(map[string]interface{}),
                    workToDo: make(chan LogMutator, 10),
                    writeTicker: time.NewTicker(params.WriteInterval)}
        l.data["log"] = l.entry
        l.entry["properties"] = l.properties
    
  • In logger.go work() you can test if the channel is closed using the 2nd return value "valid" to stop the goroutine:
    func (l *Logger) work() {
        for {
            select {
            case <-l.writeTicker.C:
                if l.modified {
                    l.write(false)
                    l.modified = false
                }
            case mutator, valid := <-l.workToDo:
                if !valid {
                    return
                }
                mutator(l.properties, l.entry)
                l.modified = true
            }
        }
    }
    
  • Having write() block the logger work loop is not that big a deal as long as the mutator queue is long enough to soak up any pending changes and not block the data producer thread
  • As I suggested previously, Keep services should be reported based on their service table uuid, with the host/port as secondary fields. Host and port may change based on cluster configuration but the UUID is stable, which is necessary to create reports that graph usage over time.
  • The report should include the total size of data managed under Keep:
    • total size of all blocks stored across all keep servers
    • deduplicated total (total size of all distinct blocks)
    • The per-user/per-group data size reporting should break down this way as well
  • It doesn't correlate the set of blocks listed in the collections owned by a user with the set of blocks actually stored in Keep as reported by keepstore. In my setup, I have collections records copied from the public beta server which total to 7 TiB, but I don't actually have 7 TiB of data in my development Keep instance. This is important because in the future we intend to have federated storage, where blocks may be stored offsite -- blocks which are not actually stored on the cluster shouldn't be counted against the user's disk usage.
  • If datamanager starts and immediately fails (such as a missing "-data-manager-token-file" command) it still writes a "data-manager-final" report which is a little confusing (this is why I suggested a distinct "error" event type).
  • Why is "Uuid to Size used" logged to the console in GetCollectionsAndSummarize()? It is already written out to the final report and doesn't seem useful in a "tell me what the service is doing" kind of way (maybe useful for debugging?)
  • Please merge with master

I'd do want to move this along towards merging so let me know if you feel like any of these comments are out of scope / too much work to hold up the merge.

Actions #24

Updated by Misha Zatsman about 9 years ago

Peter Amstutz wrote:

  • The Arvados convention for timestamps is to end in "_at" so runInfo["started_at"] instead of runInfo["time_start"] (and "finished_at" instead of "time_finished")

ok, changed. Where are these conventions recorded so next time I can check them myself?

  • In logger.go NewLogger(), you should be able to initialize more of the fields in the struct constructor:
    [...]

Done, thanks!

  • In logger.go work() you can test if the channel is closed using the 2nd return value "valid" to stop the goroutine:
    [...]

Writing to a closed channel will cause a panic. How do I prevent other goroutines from calling logger.Update() after I've closed the channel? Reintroduce the a lock that we replaced with channels?

  • Having write() block the logger work loop is not that big a deal as long as the mutator queue is long enough to soak up any pending changes and not block the data producer thread

ok

  • As I suggested previously, Keep services should be reported based on their service table uuid, with the host/port as secondary fields. Host and port may change based on cluster configuration but the UUID is stable, which is necessary to create reports that graph usage over time.

Changed, thanks for the reminder.

  • The report should include the total size of data managed under Keep:
    • total size of all blocks stored across all keep servers
    • deduplicated total (total size of all distinct blocks)
    • The per-user/per-group data size reporting should break down this way as well
  • It doesn't correlate the set of blocks listed in the collections owned by a user with the set of blocks actually stored in Keep as reported by keepstore. In my setup, I have collections records copied from the public beta server which total to 7 TiB, but I don't actually have 7 TiB of data in my development Keep instance. This is important because in the future we intend to have federated storage, where blocks may be stored offsite -- blocks which are not actually stored on the cluster shouldn't be counted against the user's disk usage.

These are great feature requests. Feel free to add them to https://arvados.org/projects/arvados/wiki/Data_Manager_Design_Doc#Additional-Reports

  • If datamanager starts and immediately fails (such as a missing "-data-manager-token-file" command) it still writes a "data-manager-final" report which is a little confusing (this is why I suggested a distinct "error" event type).

Yup, I proposed this over email and explicitly called out the fact that the final event type is used regardless of an error or a regular completion. You wrote back and said it sounds good.

  • Why is "Uuid to Size used" logged to the console in GetCollectionsAndSummarize()? It is already written out to the final report and doesn't seem useful in a "tell me what the service is doing" kind of way (maybe useful for debugging?)

Useful for debugging, just to see that things are happening.

  • Please merge with master

Done, after much git wrangling with Tom.

I'd do want to move this along towards merging so let me know if you feel like any of these comments are out of scope / too much work to hold up the merge.

Done, thanks!

Actions #25

Updated by Ward Vandewege about 9 years ago

  • Target version changed from 2015-02-18 sprint to 2015-03-11 sprint
Actions #26

Updated by Ward Vandewege about 9 years ago

  • Story points changed from 5.0 to 3.0
Actions #27

Updated by Ward Vandewege about 9 years ago

  • Target version changed from 2015-03-11 sprint to 2015-04-01 sprint
Actions #28

Updated by Ward Vandewege about 9 years ago

  • Target version changed from 2015-04-01 sprint to 2015-04-29 sprint
Actions #29

Updated by Ward Vandewege almost 9 years ago

  • Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
Actions #30

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-05-20 sprint to 2015-06-10 sprint
Actions #31

Updated by Peter Amstutz almost 9 years ago

Reviewing 5824ee2

  • "DefaultReplicationLevel" should come from the "defaultCollectionReplication" key in the discovery document (GET /discovery/v1/apis/arvados/v1/rest from the API server, we should probably add a SDK method to arvadosclient for convenience) instead of being a constant.
  • Instead of "MaybeReadData" and "MaybeWriteData" in singlerun(), perhaps have a "var GetKeepData" which is a pointer to a function which is set to implement the appropriate behavior for testing/development/production?
  • Data manager should be aware of the new read only flag on the keep_services table (implemented last sprint in #4717) and take that into account when building pull lists in ComputePullServers. (see kc.WritableLocalRoots())
  • Also in ComputePullServers, if numCopies == 0 that should be logged since that's potentially a serious problem (since that means data is missing.)
  • If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.
  • Please put "WritePullLists" behind a flag, so it is optionally enabled (obviously it's only intended for development/testing, but it's possible it might be useful for production logging.)

That's my high level comments so far, I'll continue looking at it tomorrow

Actions #32

Updated by Peter Amstutz almost 9 years ago

More comments.
In BuildPullLists, I think the assignment spl[destination] = pullList inside the if !pullListExists is redundant.

            pullList, pullListExists := spl[destination]
            if !pullListExists {
                pullList = PullList{}
                spl[destination] = pullList
            }
            pullList = append(pullList,
                PullRequest{Locator: locator, Servers: pullServers.From})
            spl[destination] = pullList

Looking at the keepstore code, I think it is expecting the url scheme (http:// or https://) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.

For the longer term design, if you're having heap size issues storing all the blocks, would it make sense to look at using an embedded database such as sqlite or https://github.com/boltdb/bolt ?

Actions #33

Updated by Peter Amstutz almost 9 years ago

I ran it without any arguments and here's what I got:

$ ./datamanager 
2015/05/22 11:11:21 50 collections read, 50 new in last batch, 2013-07-10T08:01:20Z latest modified date, 322 557 16113 avg,max,total manifest size
2015/05/22 11:11:21 Received keep services list: {ItemsAvailable:2 KeepServers:[petere1:25107 petere1:25108]}
2015/05/22 11:11:21 Got Server Addresses: {false [petere1:25107 petere1:25108] map[petere1:25107:0 petere1:25108:1] map[] map[] map[]}
2015/05/22 11:11:21 About to fetch keep server contents from http://petere1:25107/index
Usage of ./datamanager:
  -data-manager-token-file="": File with the API token we should use to contact keep servers.
  -heap-profile="": File to write the heap profiles to. Leave blank to skip profiling.
  -log-event-type-prefix="experimental-data-manager": Prefix to use in the event_type of our arvados log entries. Set to empty to turn off logging
  -log-frequency-seconds=20: How frequently we'll write log entries in seconds.
  -minutes-between-runs=0: How many minutes we wait betwen data manager runs. 0 means run once and exit.
  -read-data-from="": Avoid network i/o and read summary data from this file instead. Used for development only.
  -write-data-to="": Write summary of data received to this file. Used for development only.
2015/05/22 11:11:21 About to fetch keep server contents from http://petere1:25108/index
2015/05/22 11:11:21 Data Manager Token needed, but data manager token file not specified.

If command line arguments are required, it should fail up front instead of starting to do stuff and then failing later on.

Actions #34

Updated by Peter Amstutz almost 9 years ago

Also, API server provides microsecond timestamps, but it looks like datamanager is truncating that to whole seconds. Using the full precision should make it easier to avoid fetching collections redundantly as there should be fewer cases of collections having exactly the same timestamp.

Actions #35

Updated by Peter Amstutz almost 9 years ago

Okay, I was able to run data manager against my local development instance and get a report.

Scratch one of my prior comments, I see now that MaybeReadData and MaybeWriteData are already optional on the command line behind "-read-data-from" and "-write-data-to".

Actions #36

Updated by Misha Zatsman almost 9 years ago

Peter Amstutz wrote:

Reviewing 5824ee2

Thanks!

  • "DefaultReplicationLevel" should come from the "defaultCollectionReplication" key in the discovery document (GET /discovery/v1/apis/arvados/v1/rest from the API server, we should probably add a SDK method to arvadosclient for convenience) instead of being a constant.

Sounds great, I added it as a todo for when the SDK method is added.

  • Instead of "MaybeReadData" and "MaybeWriteData" in singlerun(), perhaps have a "var GetKeepData" which is a pointer to a function which is set to implement the appropriate behavior for testing/development/production?

Good call, lemme know what you think of my implementation. I had to use a builder, so it's a little clumsier than I thought it would be at first. datamanager.go and file.go are the two files affected by this change.

  • Data manager should be aware of the new read only flag on the keep_services table (implemented last sprint in #4717) and take that into account when building pull lists in ComputePullServers. (see kc.WritableLocalRoots())

Thanks, I'm using WriteableLocalRoots now. This required changing the logic (and signature) of CreatePullServers somewhat.

  • Also in ComputePullServers, if numCopies == 0 that should be logged since that's potentially a serious problem (since that means data is missing.)

Missing data is actually detected upstream in SummarizeBuckets in summary.go

0 numCopies here would indicate a logic error, since we're iterating over blocks which exist in keep, but at replication levels lower than the collections require.

FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections

  • If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.

Where do you want this logged? In the log entry that we write to the API server? Or to the console?

  • Please put "WritePullLists" behind a flag, so it is optionally enabled (obviously it's only intended for development/testing, but it's possible it might be useful for production logging.)

That is how I wrote it.

In BuildPullLists, I think the assignment spl[destination] = pullList inside the if !pullListExists is redundant.

Fixed, thanks!

Looking at the keepstore code, I think it is expecting the url scheme (http:// or https://) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.

Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.

For the longer term design, if you're having heap size issues storing all the blocks, would it make sense to look at using an embedded database such as sqlite or https://github.com/boltdb/bolt ?

I thought I was having heap size issues, but it turns out the problem was actually on the api server.

If command line arguments are required, it should fail up front instead of starting to do stuff and then failing later on.

I just ran it and it failed after 135 ms. That seems fast enough to me.

Flags are defined in the files that use them, there is no global view of which flags are needed because it is a complex question. For example, If you are reading summary data from a file, you don't need a datamanager token.

I like the ease of maintenance of the current setup and I think it outweighs the cost of having a few extra lines for the reader to read. We'll never be able to pre-check all conditions (token file specified, token file exists, token file readable by user, etc...) so we have to accept that sometimes we have to discover these failures while running . Trying to explictly list what flag combinations we need runs the risk of having those rules grow outdated vs our actual needs.

Also, API server provides microsecond timestamps, but it looks like datamanager is truncating that to whole seconds. Using the full precision should make it easier to avoid fetching collections redundantly as there should be fewer cases of collections having exactly the same timestamp.

Where did it look like we were truncating to whole seconds? I think what your seeing is that the data is stored that way in the api server.

#6076 Sets out to fix the problem on the api server.

Actions #37

Updated by Peter Amstutz almost 9 years ago

Misha Zatsman wrote:

Sounds great, I added it as a todo for when the SDK method is added.

Can you review the branch 6235-go-sdk-discovery ? This adds the relevant SDK method.

Good call, lemme know what you think of my implementation. I had to use a builder, so it's a little clumsier than I thought it would be at first. datamanager.go and file.go are the two files affected by this change.

Looks good, thanks.

  • Also in ComputePullServers, if numCopies == 0 that should be logged since that's potentially a serious problem (since that means data is missing.)

Missing data is actually detected upstream in SummarizeBuckets in summary.go

I see now. SummarizeBuckets categorizes the blocks into various BlockSets.

I'm confused by readCollections.BlockToReplication, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.

FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections

  • If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.

Where do you want this logged? In the log entry that we write to the API server? Or to the console?

So, we need to be able to get at above information. Spamming the console log may not be the best way, but it does need to be provided because it should be a red flag for an admin that something is amiss.

Looking at the keepstore code, I think it is expecting the url scheme (http:// or https://) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.

Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.

Tom and I agreed that it should include "http://" or "https://". The spec has been updated, so please update DataManager.

Actions #38

Updated by Misha Zatsman almost 9 years ago

Peter Amstutz wrote:

Misha Zatsman wrote:

Sounds great, I added it as a todo for when the SDK method is added.

Can you review the branch 6235-go-sdk-discovery ? This adds the relevant SDK method.

Reviewed, I just had one question.

Actions #39

Updated by Peter Amstutz almost 9 years ago

Misha Zatsman wrote:

Peter Amstutz wrote:

Misha Zatsman wrote:

Sounds great, I added it as a todo for when the SDK method is added.

Can you review the branch 6235-go-sdk-discovery ? This adds the relevant SDK method.

Reviewed, I just had one question.

Responded on #6235

Also reviewed 3408-add-size-to-locators. Go ahead and merge that with 3408-production-datamanager. Let me know when you've responded to my other review comments in note 37. Finally, I'll talk to Brett about adding story for next sprint to set up some kind of integration test between data manager, the API server and the keep servers.

Actions #40

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-06-10 sprint to 2015-07-08 sprint
Actions #41

Updated by Misha Zatsman almost 9 years ago

Peter Amstutz wrote:

Can you review the branch 6235-go-sdk-discovery ? This adds the relevant SDK method.

I was surprised to find that this value is type float64, I added a comment to #6235.

I'm confused by readCollections.BlockToReplication, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.

Renamed and added a note.

FYI on qr1hi, as of mid march, there were 96978 blocks which are specified in cllections on the api server but do not appear in keep. So this is a large amount of information to log. If we just log the collections themselves, it becomes much smaller, but it's still 689 collections

  • If a block can't be replicated sufficiently (not enough writable servers) that should also be logged.

Where do you want this logged? In the log entry that we write to the API server? Or to the console?

So, we need to be able to get at above information. Spamming the console log may not be the best way, but it does need to be provided because it should be a red flag for an admin that something is amiss.

Totaly agree it's a red flag. The counts appear in the log entry under summary_info, so we can monitor those values for any kind of alerting we want to do.

Looking at the keepstore code, I think it is expecting the url scheme (http:// or https://) in the server strings of the pull lists, however you're stripping that off (which is what the spec says). Either the spec/datamanager or keepstore will need to be fixed. More generally, we need to figure out how to set up some kind of integration test between them.

Cool, I'll leave it with the spec version, thanks for noticing that. Integration test sounds like a good longterm goal, but I hope in the future we'll make it easier to isolate integration tests from unit tests, so that developers can run just the unit tests if they want.

Tom and I agreed that it should include "http://" or "https://". The spec has been updated, so please update DataManager.

Done. This wasted effort of rewriting could have been avoided if whoever decided to change the implementation had also updated the spec. Preventing waste is exactly why I took the time to write a spec in the first place.

Also reviewed 3408-add-size-to-locators. Go ahead and merge that with 3408-production-datamanager.

Merged.

I just ran the datamanger a bunch of times and it seems to be reliably working. It finishes super fast too (under ten minutes), which is great after a year of watching it take forever.

Let me know what you think, I believe this is ready to merge.

Actions #42

Updated by Peter Amstutz almost 9 years ago

Misha Zatsman wrote:

Peter Amstutz wrote:

Can you review the branch 6235-go-sdk-discovery ? This adds the relevant SDK method.

I was surprised to find that this value is type float64, I added a comment to #6235.

Responded on #6235

I'm confused by readCollections.BlockToReplication, is this the actual replication level, or the desired replication level? Consider tweaking the variable name to blockToDesiredReplication. Also maybe add a comment that the under replicated block list does not include blocks with 0 replicas.

Renamed and added a note.

Thanks.

Totaly agree it's a red flag. The counts appear in the log entry under summary_info, so we can monitor those values for any kind of alerting we want to do.

Ok. The team has been discussing the need for a real monitoring/alerting infrastructure so this can go on the list of things to monitor.

Done. This wasted effort of rewriting could have been avoided if whoever decided to change the implementation had also updated the spec. Preventing waste is exactly why I took the time to write a spec in the first place.

Well, I agree that ideally the question of "how does the keep service know which protocol to use to contact another keep service" should have come up when the spec was reviewed, but that was an oversight.

I just ran the datamanger a bunch of times and it seems to be reliably working. It finishes super fast too (under ten minutes), which is great after a year of watching it take forever.

That's very good to hear, several significant performance fixes for handling collections the API server were merged in the last sprint so it is very good news that it translates into better performance for datamanager.

Let me know what you think, I believe this is ready to merge.

Er, tell me if I'm just overlooking it, but this appears to not yet have the actual code to POST the pull list to each keep service?

Possibly we could do that in #6260 (data manager + keepstore integration test) since the purpose of the integration test is to validate that the data manager and keepstore can actually work together to execute the pull list behavior successfully.

I will talk to Brett tomorrow to decide how we want to proceed.

Actions #43

Updated by Peter Amstutz almost 9 years ago

Please go ahead and merge to master.

Actions #44

Updated by Brett Smith almost 9 years ago

  • Assigned To changed from Misha Zatsman to Peter Amstutz

Peter, please sync up with me about the current state of this branch, and let's figure out next steps for it.

Actions #45

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-07-08 sprint to 2015-07-22 sprint
Actions #46

Updated by Brett Smith almost 9 years ago

  • Story points changed from 3.0 to 1.0
Actions #47

Updated by Brett Smith almost 9 years ago

  • Status changed from In Progress to Rejected

To spell out a little better what should happen with the issue this sprint:

  • Does the outstanding Data Manager branch have code to generate trash lists? If not, let's assess what it would take to add that, and consider doing it if it can be done in an ideal development day.
  • If it gets to that point this sprint, Peter should take point on actually merging the branch to master.
Actions #48

Updated by Brett Smith almost 9 years ago

  • Status changed from Rejected to In Progress
Actions #49

Updated by Peter Amstutz almost 9 years ago

  • Assigned To changed from Peter Amstutz to Paolo Di Tommaso
  • Status changed from In Progress to Resolved

Marking as resolved because the last push from Misha is merged and further development will occur through new stories instead of continuing this epic story.

Actions #50

Updated by Peter Amstutz over 8 years ago

  • Assigned To changed from Paolo Di Tommaso to Peter Amstutz
Actions

Also available in: Atom PDF