API Historical/Forcasting data for CR

Goal: create a pipeline forecaster and visualization for historical data. This should expose APIs that can be used in the ContainerRequest visualization and
also could be use to provide extra information for the current running CR

Glossary:

  • Checkpoint: is a generic name that currently corresponds to a step name. This id together with "family" make a unique cluster to summarize results. This summarization for the unique cluster includes: a) several runs with similar parameters and b) scattered steps that have the pattern: name_2, name_3,..., name_229
  • Family: A common name like "gatk" or "haplotypecaller" can be used as a step name. The family definition will help to separate the 2 populationsn terms of checkpoints. We think that implementing this based the parameters of CommandLineTool and parent workflow md5sum or a combination of both
  • Datapoint: a concrete data that can be plotted as historical data. Currently we're bounding together the container request and the associated container to have a unified view of the times involved. This should not get confused with forecast data since can be used separately

API

The "checkpoints" endpoint is where the stadistics that will be used as forecasting. Right now as an example we'll start with time_* keys, but in the future this will expose all the data needed to do an acurate forecast.

GET /container-request/aaaaa-xvhdp-123456789abc/checkpoints

Output:

{
  "checkpoints": [
    {
      "name": "merge-tilelib",
      "family": "family22",
      "dependencies": [
        "createsglf" 
      ],
      "time_average": 8254.534873,
      "time_count": 1,
      "time_min": 8254.534873,
      "time_min_comment": "duration:merge-tilelib#su92l-dz642-cc7799yfwi5jmd9",
      "time_max": 8254.534873,
      "time_max_comment": "duration:merge-tilelib#su92l-dz642-cc7799yfwi5jmd9" 
    },
    {
      "name": "createsglf",
      "family": "family9",
      "dependencies": [],
      "time_average": 4741.290203,
      "time_count": 58,
      "time_min": 82.138309,
      "time_min_comment": "duration:createsglf_57#su92l-dz642-3u3g4bq1yh4pqje",
      "time_max": 5818.898387,
      "time_max_comment": "duration:createsglf_8#su92l-dz642-8d094xhqciin5m2" 
    },
...
],
"time_average": <average time for the CR family>,

GET /container-request/aaaaa-xvhdp-123456789abc/datapoints

Output:

[
  {
    "step_name": "createsglf",
    "start_1": "2020-01-15 19:49:34.213 +0000",
    "end_1": "2020-01-15 21:19:39.001 +0000",
    "start_2": "2020-01-15 19:54:44.864 +0000",
    "end_2": "2020-01-15 21:19:39.001 +0000",
    "reuse": false,
    "status": "completed",
    "legend": "<p>createsglf</p><p>Container Request: <a href=\"https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-zfc3ffxk3slmkzv\">su92l-xvhdp-zfc3ffxk3slmkzv</a></p><p>Container duration: 1h24m54.137122s\n</p>" 
  },
  {
    "step_name": "createsglf_2",
    "start_1": "2020-01-15 19:49:34.288 +0000",
    "end_1": "2020-01-15 21:29:11.399 +0000",
    "start_2": "2020-01-15 19:54:51.275 +0000",
    "end_2": "2020-01-15 21:29:11.399 +0000",
    "reuse": false,
    "status": "completed",
    "legend": "<p>createsglf_2</p><p>Container Request: <a href=\"https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-py99va9hnvuxzp5\">su92l-xvhdp-py99va9hnvuxzp5</a></p><p>Container duration: 1h34m20.123849s\n</p>" 
  },
....

GET /container-request/aaaaa-xvhdp-123456789abc/workflow-dot

Output:

digraph cwlgraph {
rankdir=LR;
graph [compound=true];

subgraph cluster_0 {
label="#createcgf-wf.cwl";
node [style=filled];
shape=box
style="filled";
color="#dddddd";
"#createcgf-wf.cwl" [ label = "#createcgf-wf.cwl", style = invis ];
....

Frontend

Dot file can be rendered with https://domparfitt.com/graphviz-react/ we already tested some big files

Schema and queries on the postgres DB

TODO: Outline the transformation from the current local leveldb cache to some per-user caching table.
TODO: list the queries to INSERT and SELECT the data for a particular checkpoint.

Permissions

One concern is permissions. we'll behave similar to everything else in Arvados: if it's a CR that the token doesn't have access to, then is a 404. This includes the idea of "sumarized data" as in the historical time and prices of the CRs

When forecasting a CR for a given user, we should only use data about containers that user can see. This has implications for caching:
  • When responding to user A, can't reuse cached results that we generated for user B
  • When using cached results, need to consider whether to recompute to reflect recent permission changes

Real World Example

Take the case of su92l-xvhdp-bs4tseq26te2bnz ( a hasher function that Ops usually use as smoke test)

graph

the dotty representation would be:

digraph cwlgraph {
rankdir=LR;
graph [compound=true];

subgraph cluster_0 {
label="#main";
node [style=filled];
shape=box
style="filled";
color="#dddddd";
"#main" [ label = "#main", style = invis ];

"#main
inputfile" -> "step #main
hasher1";
"#main
hasher1_outputname" -> "step #main
hasher1";
"step #main
hasher1" -> "#main
hasher1
hasher_out";
"#main
hasher1
hasher_out" -> "step #main
hasher2";
"#main
hasher2_outputname" -> "step #main
hasher2";
"step #main
hasher2" -> "#main
hasher2
hasher_out";
"#main
hasher2
hasher_out" -> "step #main
hasher3";
"#main
hasher3_outputname" -> "step #main
hasher3";
"step #main
hasher3" -> "#main
hasher3
hasher_out";
}

"step #main
hasher1" [fillcolor="#FFD700", style="rounded,filled", shape=box];
"step #main
hasher2" [fillcolor="#FFD700", style="rounded,filled", shape=box];
"step #main
hasher3" [fillcolor="#FFD700", style="rounded,filled", shape=box];
"#hasher.cwl" [fillcolor="#FF9912", style="rounded,filled", shape=box];

"step #main
hasher1" -> "#hasher.cwl" [label="runs", style="dashed"];
"step #main
hasher2" -> "#hasher.cwl" [label="runs", style="dashed"];
"step #main
hasher3" -> "#hasher.cwl" [label="runs", style="dashed"];
}

datapoints

[
  {
    "checkpoint": "hasher1",
    "start_1": "2020-05-12 16:35:33.594 +0000",
    "end_1": "2020-05-12 16:37:30.597 +0000",
    "start_2": "2020-05-12 16:37:27.893 +0000",
    "end_2": "2020-05-12 16:37:30.597 +0000",
    "reuse": false,
    "legend": "<p>hasher1</p><p>Container Request: <a href=\"https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-pbpkli9qovdo4q8\">su92l-xvhdp-pbpkli9qovdo4q8</a></p><p>Container duration: 2.70491s\n</p>" 
  },
  {
    "checkpoint": "hasher2",
    "start_1": "2020-05-12 16:37:33.673 +0000",
    "end_1": "2020-05-12 16:39:56.562 +0000",
    "start_2": "2020-05-12 16:39:51.455 +0000",
    "end_2": "2020-05-12 16:39:56.562 +0000",
    "reuse": false,
    "legend": "<p>hasher2</p><p>Container Request: <a href=\"https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-l8je8tws556fqcp\">su92l-xvhdp-l8je8tws556fqcp</a></p><p>Container duration: 5.10645s\n</p>" 
  },
  {
    "checkpoint": "hasher3",
    "start_1": "2020-05-12 16:39:57.608 +0000",
    "end_1": "2020-05-12 16:42:17.628 +0000",
    "start_2": "2020-05-12 16:42:14.836 +0000",
    "end_2": "2020-05-12 16:42:17.628 +0000",
    "reuse": false,
    "legend": "<p>hasher3</p><p>Container Request: <a href=\"https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-jx5vk6lq26dsbba\">su92l-xvhdp-jx5vk6lq26dsbba</a></p><p>Container duration: 2.792018s\n</p>" 
  }
]

checkpoints

{
  "checkpoints": [
    {
      "name": "hasher2",
      "family": "abde1234-9876543",
      "dependencies": [
        "hasher1" 
      ],
      "time_average": 5.10645,
      "time_count": 1,
      "time_min": 5.10645,
      "time_min_comment": "duration:hasher2#su92l-dz642-eouma4xv1qpnhvc",
      "time_max": 5.10645,
      "time_max_comment": "duration:hasher2#su92l-dz642-eouma4xv1qpnhvc" 
    },
    {
      "name": "hasher3",
      "family": "87654321-fedcba01",
      "dependencies": [
        "hasher2" 
      ],
      "time_average": 2.792018,
      "time_count": 1,
      "time_min": 2.792018,
      "time_min_comment": "duration:hasher3#su92l-dz642-tn9t07438jd1zrt",
      "time_max": 2.792018,
      "time_max_comment": "duration:hasher3#su92l-dz642-tn9t07438jd1zrt" 
    },
    {
      "name": "hasher1",
      "family": "deadbeef-deafbeef",
      "dependencies": [],
      "time_average": 2.70491,
      "time_count": 1,
      "time_min": 2.70491,
      "time_min_comment": "duration:hasher1#su92l-dz642-e6d8emz3ez54owu",
      "time_max": 2.70491,
      "time_max_comment": "duration:hasher1#su92l-dz642-e6d8emz3ez54owu" 
    }
  ]
}