Project

General

Profile

Idea #11876

Updated by Tom Morris about 7 years ago

Overview: 

 As an R programmer I'd like to have the ability to query the Arvados APIs directly from R using a package which integrates well with and is published with the rest of the Bioconductor packages. The SDK should ideally allow me to do everything a Python programmer can do using the Python SDK. 

 As a first step, the R SDK should allow me to allow to find collections and files in Keep using filtering on metadata, load the files into R, process them and then write the results back to a collection. 

 As an optional second stage, it'd be useful to be able to submit CWL jobs and monitor their progress. 

 The SDK should work on Windows, OS X, and Linux, which implies that depending on arv-mount for file reading and writing is not an acceptable option. Instead, we will use the webdav support in keep-web. Read-only support is already available (completed in issue #12216). Write support is forthcoming, see issue #12483. 

 More detail: 

 There are two relevant packages, SevenBridges "sevenbridges" and Illumina's "BaseSpaceR", which could be used for a) design ideas and b) starting points for implementation (they are both Apache licensed). 

 http://bioconductor.org/packages/release/bioc/html/sevenbridges.html 
 https://github.com/sbg/sevenbridges-r 
 http://bioconductor.org/packages/release/bioc/html/BaseSpaceR.html 
 https://developer.basespace.illumina.com/docs/content/documentation/sdk-samples/r-sdk-overview 

 A potential supporting component might be googleAuthR http://code.markedmondson.me/googleAuthR/ which could be used in a similar way to googleComputeEngineR https://cloudyr.github.io/googleComputeEngineR/ and other packages which are layered on it. googleAuthR can be used for API generation and response parsing, but needs to be reworked to not assume Google authentication or endpoints. Instead of the OAuth2 dance, it needs to be able to use an API token. 

 This code snippet will generate an entire R stub package using googleAuthR: 

 <pre> 
 library('googleAuthR') 
 url="https://qr1hi.arvadosapi.com/discovery/v1/apis/arvados/v1/rest" 
 req <- httr::RETRY("GET", url) 
 httr::stop_for_status(req) 
 content <- httr::content(req,as="text") 
 api_description <- jsonlite::fromJSON(content) 
 paste("Loaded The full Arvados API description ", api_description$name, api_description$version) 

 "Generating API skeleton" 
 gar_create_api_objects(filename = "arvados_objects.R",api_json = api_description) 
 gar_create_api_skeleton('arvados_functions.R', api_description, format=TRUE) 
 "API Generation complete" 

 # Make sure we can load our newly generated code 
 source('arvados_functions.R') 
 source('arvados_objects.R') 

 # Generate the whole package at once 
 gar_create_package(api_description, '/tmp/aRv', rstudio = TRUE, check = TRUE, github = FALSE) 
 </pre> 

 currently consists of 24 object types (plus 24 list types for those objects) and 223 methods. There gar_create_package call does the whole thing including man pages, README, etc, but the gar_create_api_objects are create, delete, destroy, get, list, show, and gar_create_api_skeleton, can be update methods for each object type and then another 26 methods which are used to just do a part of the process. once or twice each. 

 The generator assumes the context of a Google API, so has a bunch of built-in assumptions that need to be cleaned up. Below is a non-exaustive list: 
 * authentication Test comparisons: Python - switch from Google auth to Arvados token based authentication, remove/fix all references to googleAuthR::gar_auth() and Google API scopes 
 * fixed base URL - 8K lines in the above example qr1hi.arvadosapi.com is hardwired into the API. This needs to be configurable by the caller. 
 * man page generation 472 tests, Golang - there's a bunch of warnings due formatting 4K lines in the docs 
 * Bioconductor packaging, types, conventions,tests - the core generator targets CRAN tests. This may need 213 tests,  

 Object Types to be extended for Bioconductor 
 * LICENSE & AUTHOR - these are wrong need to figure out where their contents come from 

 It is desirable that changes to the code generator be done supported in such as way that they can be adopted by the upstream project as parameterizable options, but it's not mandatory. initial version: 

 There are also additional things which need to be added: Collection  
 Container  
 ContainerRequest  
 Group (user groups & projects) 
 * tests Link  
 Log ? 
 * vignettes/examples User ? 
 Workflow  

 Object Types not needed initially: 

 Some hints on testing and other advanced ApiClient  
 ApiClientAuthorization  
 AuthorizedKey  
 Human  
 Job  
 JobTask  
 KeepDisk  
 KeepService  
 Log  
 Node  
 PipelineInstance  
 PipelineTemplate  
 Repository  
 Specimen  
 Trait  
 UserAgreement  
 VirtualMachine  


 Miscellaneous Arvados API topics are here: 
 http://code.markedmondson.me/googleAuthR/articles/advanced-building.html methods: 

 There are two relevant packages, SevenBridges "sevenbridges" and Illumina's "BaseSpaceR", which could be used to compare against or as sources for code (they are both Apache licensed). 

 http://bioconductor.org/packages/release/bioc/html/sevenbridges.html 
 https://github.com/sbg/sevenbridges-r 
 http://bioconductor.org/packages/release/bioc/html/BaseSpaceR.html 
 https://developer.basespace.illumina.com/docs/content/documentation/sdk-samples/r-sdk-overview 
 <pre> 
    1 accessible  
    1 activate  
    1 auth  
    2 cancel  
    1 contents  
    1 create_system_auth  
    3 current  
    1 get_all_logins  
    1 get_all_permissions  
    1 get_permissions  
    2 lock  
    1 logins  
    1 new  
    2 ping  
    1 provenance  
    1 queue  
    1 queue_size  
    1 setup   
    1 sign  
    1 signatures  
    1 system  
    1 trash  
    1 unlock  
    1 unsetup  
    1 untrash  
    1 used_by  
 </pre>

Back