Feature #12018

Synchronize group membership with external data source

Added by Tom Morris 3 months ago. Updated about 1 hour ago.

Status:In ProgressStart date:07/21/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:


Target version:2017-10-25 Sprint
Story points2.0Remaining (hours)0.00 hour
Velocity based estimate0 days


As a user in a corporate environment, I want to be able to synchronize the users in my Arvados groups with my corporate directory service (ActiveDirectory, LDAP, etc).

This doesn't need to be instantaneous, but can instead by done either periodically on a scheduled based or on demand. A script-based solution is an acceptable answer.

Groups which get created by this mechanism get tagged so that they're known to be automatically created. Groups are not given any particular permissions when they are created.

Input is a two column CSV file with a column of Group name and one column of user IDs (either username or user email address) with a command flag which controls whether the user ID is username or email address. If a user is no longer included in the input file, they get removed from the group membership.

Workbench needs to be changed to not allow admins to modify group membership for synched.

Tool should report errors for any users who don't have matching user IDs. Groups which don't exist get created and their UUIDs get reported. If an untagged group exists and is also in the input file, a warning is issued.


Task #12264: Review 12018-sync-groups-tool In ProgressTom Clegg


#1 Updated by Tom Morris 2 months ago

  • Description updated (diff)
  • Target version changed from Arvados Future Sprints to 2017-08-30 Sprint
  • Story points set to 2.0

#2 Updated by Tom Morris 2 months ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#3 Updated by Tom Morris about 1 month ago

  • Target version changed from 2017-09-13 Sprint to 2017-09-27 Sprint

#4 Updated by Lucas Di Pentima about 1 month ago

  • Assignee set to Lucas Di Pentima

#5 Updated by Lucas Di Pentima 23 days ago

Updates at fd14dc21b - branch 12018-sync-groups-tool (WIP)

Added a first version of the new command arv-sync-groups, it currently does most of the described requirements, it's pending the reporting of untagged existing groups, and created group uuids.

About the first missing feature, I have a doubt: Should these groups be created as "system groups"? (ie: owner_uuid == {prefix}-tpzed-0000000000000). I ask because there's a unique index on groups that's composed of (owner_uuid, group_name) so to check for untagged but existing groups, I suppose I have to retrieve all groups from a particular owner and avoid name collision with other user's groups.

Related to the above, here's another question: Should this command use credentials from the environment vars as arv-put and others, or as it's more an admin tool, should read auth credentials from some config file?

#6 Updated by Lucas Di Pentima 23 days ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg 22 days ago

How about making a single parent group, and using that as the owner_uuid of all synchronized groups?
  • If --parent-group-uuid arg is given, use that uuid as the parent. Log a warning if it's not owned by system_user.
  • Otherwise, create/find a group (owned by system-user) named "Externally synchronized groups", and use that as the parent

For credentials, just let the SDK do its default thing (env vars, then ~/.config/arvados/settings.conf).

#8 Updated by Lucas Di Pentima 22 days ago

  • Target version changed from 2017-09-27 Sprint to 2017-10-11 Sprint

#9 Updated by Lucas Di Pentima 8 days ago

  • Target version changed from 2017-10-11 Sprint to 2017-10-25 Sprint

#10 Updated by Lucas Di Pentima 7 days ago

First golang version at 46141b6c9098f30dcd6644845887789c1c9006da

It's basically a direct translation of the python version, I feel that I used too many type assertions, making the code "ugly", difficult to read. I suppose I should be creating specific types, for example a Set type instead of using map[string]struct{}

I was thinking about adding ListAll() to the go SDK, as it exist on PySDK arvados.util package. Do you think it would be useful?

I would love some feedback on golang code styling, any tips will be welcome. Thanks!

#11 Updated by Tom Clegg 7 days ago

You're right about all the type shenanigans. That map[string]arvadosclient.Dict stuff in the arvadosclient module wasn't the best idea. We can do something more like source:services/keep-balance/collection.go now -- we'll have to add Group and GroupList types in sdk/go/arvados (source:sdk/go/arvados/log.go is probably a good one to copy & modify) but once that's done, you can just load json right into a struct and use its fields like real fields, instead of all the string-keys and casting stuff.

SDK aside, you definitely shouldn't use arvadosclient.Dict for things like remoteGroups. Make a struct type with the fields you need to track, maybe something like

type groupInfo struct {
        Group           arvados.Group
        PreviousMembers map[string]bool
        CurrentMembers  map[string]bool

You can say fmt.Sprintf("%s", err), you don't need to say err.Error()

Golang style says error strings should start lowercase -- otherwise you end up chaining together into "Can't do this: Trouble at the mill: Shot off, completely."

I wouldn't bother trying to os.Stat() the input file to predict whether opening will work. Just open it, and if that doesn't work, report the error.

Instead of ("error reading csv file: %s", err) ... how about ("error reading %q: %s", *srcPath, err)

Appending to (and iterating over) a nil slice works, so you can probably drop some of the stuff like make([]interface{}, 0) -- just "var links []interface{}" would work here.

The "contains" function is unnecessary. You can do this in subtract():

if _, found := setB[element]; !found {
        result[element] = struct{}{}

Another way would be to use a map[string]bool instead of a map[string]struct{}, always assign true when you're assigning, and do this

if !setB[element] {
        result[element] = struct{}{}

#12 Updated by Lucas Di Pentima 1 day ago

Updates at ed6af9cb4

This commit is about tidying up the code, following the suggestions on note-11:

  • Removed initial Python version.
  • Removed superfluous input files checks and moved file opening to be before any API calls, to avoid doing them is there's a problem with the file.
  • Added user, group & link types so they're populated by json decoding.
  • Cleaned up ListAll() func so it can work with different resource types.
  • Changed usage of set style types from map[string]struct{} to be map[string]bool to simplify membership checking.
  • Corrected error messages to start with lowercase.
  • Added more debug messages for -verbose mode.

#13 Updated by Lucas Di Pentima about 8 hours ago

Updates at ea10340803abade2d35212866fcbc1beb1acd533

  • Added -parent-group-uuid parameter to specify a parent group for the remote groups. This group should be owned by the system user. If not provided, the tool will search for a group named "Externally synchronized groups" owned by the system user, or create one if not found.
  • Skip (with a warning message) CSV lines with an empty field
  • Check if current user is admin, fail otherwise.
  • Use the parent group uuid when searching/creating remote groups.

Tests are still pending.

#14 Updated by Lucas Di Pentima about 7 hours ago

Found a bug, working on it.

#15 Updated by Tom Clegg about 1 hour ago


In the "evict" code, it seems like we should be removing user→group and group→user links, and we shouldn't be filtering on name=can_read

Arvados stuff

Should use arvados.ResourceListParams instead of arvadosclient.Dict for limit, offset, etc.

Should use arvados.User instead of a custom user type -- just need to add the Email field to source:sdk/go/arvados/user.go, and a UserList type similar to CollectionList in source:sdk/go/arvados/collection.go

I suspect the only resourceList interface you really need to implement ListAll is "Len() int"

In ListAll, simplify by always using limit=2^31-1 and terminate when len(response.Items)==0

Paging code should use order=uuid instead of the default modified_at -- less likely to miss groups in races that way

Shouldn't we be using group properties instead of tag links for "tagged as remote"? We can't yet filter on properties using the API filters, but the expected usage is for all role groups to be synchronized, so it might be best to filter on group_class==null and do additional filtering by properties on the client side for now.

Go style

Don't split lines just to stay in 80 chars (e.g., "error creating tag for group %q: %s")

Also available in: Atom PDF