Story #8876

[Workbench] Unify job and pipeline instance show tabs using Work Unit interface

Added by Brett Smith over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
-
Target version:
Start date:
05/25/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Baseline: There is a single set of views (including partials) that renders the Status tab for both Jobs and Pipeline Instances. The UI is as close as possible to the existing Pipeline Instance status tab. It is implemented using the new Work Unit interface, so the same code can be used to render both Jobs and Pipeline Instances, without conditional branches based on the type of object being rendered.

Follow the structure in #8651-3 to implement work unit proxies for job and pipeline instance.

These methods need to be implemented in the Work Unit interface, and will generally have different implementations for jobs and pipeline instances:

  • modified_by_user_uuid
  • created_at
  • started_at
  • finished_at
  • state_label (a string to show to the user, like "Queued" or "RunningOnServer" -- this method does not try to translate states to a lowest-common-denominator set of states, or anything like that)
  • state_bootstrap_class (a class like "danger", "success", or "warning" that a view can use directly to make a display class like <span class="badge badge-success" ...>
  • success? -- true if the work unit finished successfully, false if it has a permanent failure, and nil if the final state is not determined.
  • progress (a number between 0 and 1)
  • log_collection -- uuid or pdh with saved log data, if any. (Log rendering is explicitly not part of this story.)
  • label -- for either jobs or pipeline instances, when they're the top-level thing you're looking at, the label is the name. Otherwise, it comes from the object's components field. This is assigned when the work unit is created (see #8647).
  • components -- an array of work unit objects

Both work units also need the following methods, which are expected to return nil for pipeline instances because they're not relevant:

  • parameters
  • script
  • script_repository
  • script_version
  • supplied_script_version
  • docker_image
  • runtime_constraints
  • priority
  • nondeterministic
  • output

Still needs discussing: control. Canceling jobs, canceling pipeline instances, pausing and unpausing pipeline instances.

  • can_cancel? -- should we offer the user a "cancel" button?
  • cancel
  • can_pause? -- should we offer the user a "pause" button?
  • pause

Suggest two partial views: "this is the thing being displayed," "this is a child being displayed." Job and pipeline instance views just call those partials to do their work.

Jobs show controller will get the job, get its work unit, then render the work unit view for this job's component. Same goes for pipeline instance controller.


Subtasks

Task #9207: Review branch 8876-work-unitResolvedTom Clegg


Related issues

Related to Arvados - Story #9312: [Workbench] Job children display using the new work_unit strategyNew

Associated revisions

Revision d10c7919
Added by Radhika Chippada about 3 years ago

closes #8876
Merge branch '8876-work-unit'

Revision 8630c547
Added by Radhika Chippada about 3 years ago

refs #8876
Merge branch '8876-work-unit'

History

#1 Updated by Brett Smith over 3 years ago

  • Subject changed from [Workbenchc] Unify job and pipeline instance show tabs using Work Unit interface to [Workbench] Unify job and pipeline instance show tabs using Work Unit interface
  • Description updated (diff)

#2 Updated by Brett Smith over 3 years ago

  • Description updated (diff)

#3 Updated by Brett Smith over 3 years ago

  • Description updated (diff)

#4 Updated by Brett Smith over 3 years ago

  • Description updated (diff)

#5 Updated by Brett Smith over 3 years ago

  • Description updated (diff)

#6 Updated by Brett Smith over 3 years ago

  • Target version set to Arvados Future Sprints

#7 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#8 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#9 Updated by Brett Smith over 3 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-25 sprint

#10 Updated by Brett Smith over 3 years ago

  • Description updated (diff)

#11 Updated by Brett Smith over 3 years ago

  • Target version changed from 2016-05-25 sprint to Arvados Future Sprints

#12 Updated by Brett Smith over 3 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-25 sprint

#13 Updated by Radhika Chippada over 3 years ago

  • Assigned To set to Radhika Chippada
  • Story points set to 2.0

#14 Updated by Radhika Chippada over 3 years ago

  • Status changed from New to In Progress

#15 Updated by Radhika Chippada over 3 years ago

#8651 says the following. However, this is not quite comprehensive. Assume a pipeline instance with 10 components, 1 finished and 1 still running and 8 not yet scheduled. In this case, the instance will have only 2 JobWorkUnit objects corresponding to it. Hence, the children() implementation needs to include, potentially, ProxyWorkUnit objects corresponding to those remaining 8 components.

class JobWorkUnit < ProxyWorkUnit
  def children
    n = -1
    self.proxied.job_tasks.map do |t|
      n++
      t.work_unit("task #{n}")
    end
  end
end

#16 Updated by Radhika Chippada over 3 years ago

Please do a sanity check on work unit implementation in branch 8876-work-unit at a4f061a5

Some main points:

Work unit models

  • work_unit models define exposed properties such as children etc.
  • I added those properties listed in the description and also some more properties such as “title” so that the new UI is as close to the previous as possible
  • I saw issues reading properties from the object versus object hash representation. So I used something like below to get properties. I can modify if there is a better alternative.
      def output
        (self.proxied.output if self.proxied.respond_to?(:output)) || self.proxied[:output]
      end
    

Work unit views

  • The views (pipeline and job) are very close to what they currently are.
  • To enable child display, I expanded the pipeline display to be more detailed like a job display. Thus the “main object” display includes uuid, created_at, script etc (similar to job display) and then a panel for each child. Within each child display panel, again the detailed section and a panel for each child is included.
  • In the job detail display, I switched the two columns. The first column used to be script, script version etc and the second column used to be uuid, started_at etc. I switched this order so that when a pipeline is being displayed the first display column is populated.
  • Also, to be able to not display properties that are not applicable to a work unit (for ex: script when displaying a pipeline display), the detailed section only displays the label when a value exists. In _component_detail.htnl.erb:
                    <% unless val.nil? %>
                    <tr>
                      <td style="padding-right: 1em">
                        <%= k.to_s %>:
                      </td> …
    
  • I added a can_cancel? to pipeline_instance_work_unit so that the current job cancelation option is still available.

WIP

  • I have not yet combed through the entire code to make sure code is not pipeline or job centric in certain areas and no choking on nils etc
  • The “progress bar” is still not in place for a child panel (jobs). It is currently just displaying a number.
  • A few tests are still failing
  • Need to add a pipeline fixture with jobs where each of these jobs has child jobs and add a test to verify display

#17 Updated by Peter Amstutz over 3 years ago

So overall I think this approach works, thanks for that, it just needs polish. Things I noticed:

Jobs and tasks are reporting their run time as 0m0s.

It is reporting scaling as "1/1⨯".

Timestamps are not being localized. They should be rendered like this:

<span class="utc-date" data-utc-date="2016-05-25 11:45:11 UTC" data-utc-date-opts="noseconds"></span>

It says "This work unit..." instead of "This job task..." for job tasks.

It should show task parameters for job tasks.

Job tasks should have report a state (queued/running/complete) based on whether started_at and finished_at are filled in.

It says "This job is queued." but then it says "It ran for 0m0s"

Need to support jobs with "components"

#18 Updated by Brett Smith over 3 years ago

  • Target version changed from 2016-05-25 sprint to 2016-06-08 sprint

#19 Updated by Radhika Chippada about 3 years ago

At bbbf6099

  • Addressed several bugs and loose ends. Mostly seems to work now.
  • Jobs now use the same display as pipelines, with a description section followed by a panel with children. Due to this, the job display no longer has the top panel header where the progress bar used to be displayed. I hence added the Cancel button and progress bar in the description section (for the top most object). The progress bar displays "Active" when there is no progress yet, and then replaces it with the progress bar when the progress is > 0.0. This section probably can use a bit tweaking to make it look a bit more appealing to the eye.
  • Displaying of a job's children: Tom raised a potential performance related concern that a job may have 100s of job_tasks and displaying them all might be undesirable. For the time being, the job display only includes children as found in "components" and omits any job_tasks. I created #9312 to discuss the performance concern when a lot of children are present under a job.
  • Added an integration test where a job with components is viewed and the children and their children are verified
  • One test is failing (the other tests are passing). pipeline_instances_controller_test -> "dates in JSON components are parsed" . I can fix this using similar logic as in proxy_work_unit -> started_at, but first I am trying to understand why this is failing in the first place and if there is any underlying problem. Please let me know if you think of a reason why this may be failing.

Thanks.

#20 Updated by Radhika Chippada about 3 years ago

Discussed job children display related performance concern with Tom and we agreed on the following in the scope of this story:

  • Continue to display all children of a pipeline instance as before
  • Job
    • Consider only "components" as the children of a job while displaying the child work units. Do not include any job_tasks in this context
    • While computing "progress" of a job, use components' progress when components are present. If no components exist, use progress of the job_tasks.
  • If a work unit has unreadable children, make sure the display is appropriate in this situation.

#21 Updated by Radhika Chippada about 3 years ago

a0187bc7 Improved the logic around children and progress as described in note 20.

#22 Updated by Tom Clegg about 3 years ago

Bigger questions:

With all this new code, I hoped we would be able to delete a lot of old code. Is all of the old pipeline instance / job view code still being used?

It looks like there's still an awful lot of logic in the new views, which is what we're trying to get away from. For example, "if wu.state_label == 'Complete'" seems wrong: it's being offered as a label to show the user, but the view is also using it to reason about which other things to display.

It's OK if we don't fix everything in the first merge, but it would be nice if we could assure ourselves that the work unit structure is going to help us when we do fix everything. For example, where does the "uuid of repository, if any" logic really belong? It seems to be specific to Job, so presumably it should be in JobWorkUnit?

Smaller questions:

"row-fulid" → row-fluid

What text do we expect here?

       if expect_options
         assert_text 'supplied_script_version: master'
       else
-        assert_text 'supplied_script_version: (none)'
+        assert_no_text 'supplied_script_version: (none)'
       end

Suggest using each_with_index do |c, idx| here.

+  <% wu.children.each do |c| %>
+    <% @descendent_count += 1 %>

I'm trying it out and it looks like child jobs 2..N of 4xphq-d1hrv-f9lvha7avhz582d aren't being displayed properly but I haven't figured out why yet.

#23 Updated by Radhika Chippada about 3 years ago

Tom said:

With all this new code, I hoped we would be able to delete a lot of old code. Is all of the old pipeline instance / job view code still being used?

For now my main focus is to get the UIs working with work_unit interface. I think pipeline_instance_running.html.erb is no longer needed, but I am postponing any deletions until after the review.

It looks like there's still an awful lot of logic in the new views, which is what we're trying to get away from. For example, "if wu.state_label == 'Complete'" seems wrong: it's being offered as a label to show the user, but the view is also using it to reason about which other things to display.

How about I add another wu.state method and use it instead? I also considered this versus using state_label.

It's OK if we don't fix everything in the first merge, but it would be nice if we could assure ourselves that the work unit structure is going to help us when we do fix everything. For example, where does the "uuid of repository, if any" logic really belong? It seems to be specific to Job, so presumably it should be in JobWorkUnit?

I went back and forth on this. I initially had these in the JobWorkUnit and moved into the ProxyWorkUnit. We have "priority" both in jobs and containers for example. The get() I am using results in returning nil when it is not supported by a specific work unit and hence is not a big issue imho. If you prefer, I am happy to move them into the JobWorkUnit.

What text do we expect here? assert_no_text 'supplied_script_version: (none)'

As I said in note 16, if a work unit does not support a particular property, the UI does not display "xxx: (none)". Instead, it omits displaying it. This is to help ensure the UI between Jobs and PipelineInstances, and potentially other objects in future, does not look like a patchwork of unrelated things.

I'm trying it out and it looks like child jobs 2..N of 4xphq-d1hrv-f9lvha7avhz582d aren't being displayed properly but I haven't figured out why yet.

You probably are referring to missing uuid etc column on the left for child display. As I said above, I am no longer display the empty "xxx: (none)" if a value is nil. Howerver, I think I will go back and do this suppression selectively and always display uuid, modified_by_uuid, started_at, finished_at fields since they are applicable to all three types of objects we are currently considering.

#24 Updated by Tom Clegg about 3 years ago

Radhika Chippada wrote:

For now my main focus is to get the UIs working with work_unit interface. I think pipeline_instance_running.html.erb is no longer needed, but I am postponing any deletions until after the review.

Sounds good to me.

It looks like there's still an awful lot of logic in the new views, which is what we're trying to get away from. For example, "if wu.state_label == 'Complete'" seems wrong: it's being offered as a label to show the user, but the view is also using it to reason about which other things to display.

How about I add another wu.state method and use it instead? I also considered this versus using state_label.

I don't think that really addresses the point that the view shouldn't even have this logic.

For example, in the part where we decide whether to show a log, and what to say if we can't find it, currently the view says "if state is this then that else otherthing". It would be better if the view could say "<%= wu.link_to_log %>". All that stuff about choosing live/saved log depending on the state of the w.u., and deciding what to say when a log is unavailable, could be in a work unit model class.

It's OK if we don't fix everything in the first merge, but it would be nice if we could assure ourselves that the work unit structure is going to help us when we do fix everything. For example, where does the "uuid of repository, if any" logic really belong? It seems to be specific to Job, so presumably it should be in JobWorkUnit?

I went back and forth on this. I initially had these in the JobWorkUnit and moved into the ProxyWorkUnit. We have "priority" both in jobs and containers for example. The get() I am using results in returning nil when it is not supported by a specific work unit and hence is not a big issue imho. If you prefer, I am happy to move them into the JobWorkUnit.

I'm thinking of this in terms of testing. If the logic is in the work unit classes, we can write very efficient unit tests like "a pipeline component will have a link to a repo" and "a pipeline instance will not have a link to a repo" just by calling a work unit method and checking that it returns a uuid or nil. If the logic is in the views, we can only test the logic by testing the views, and (aside from being slower to write and slower to run) those tests will tend to fail or become ineffective whenever we touch the HTML in our views -- so touching the HTML gets expensive.

What text do we expect here? assert_no_text 'supplied_script_version: (none)'

As I said in note 16, if a work unit does not support a particular property, the UI does not display "xxx: (none)". Instead, it omits displaying it. This is to help ensure the UI between Jobs and PipelineInstances, and potentially other objects in future, does not look like a patchwork of unrelated things.

OK. In that case the assertion should probably say

assert_no_text 'supplied_script_version'

I'm trying it out and it looks like child jobs 2..N of 4xphq-d1hrv-f9lvha7avhz582d aren't being displayed properly but I haven't figured out why yet.

You probably are referring to missing uuid etc column on the left for child display. As I said above, I am no longer display the empty "xxx: (none)" if a value is nil. Howerver, I think I will go back and do this suppression selectively and always display uuid, modified_by_uuid, started_at, finished_at fields since they are applicable to all three types of objects we are currently considering.

I think what I learned here is that it wasn't obvious to me that 2..N weren't actually jobs, but rather pipeline components that hadn't had jobs assigned to them.

I'm not sure "xxx: (none)" would be a step in the right direction. That seems to imply that there is something that could have a UUID, but doesn't.

How about, instead of showing a list of fields the non-existent job would have if it existed, we just say "No job has been submitted yet for this pipeline component."

Moving script_parameters over to the right might make a bit more sense -- then the stuff on the left would be "what happened when we ran the thing", and would be empty if nothing has been run/queued, while the stuff on the right would be "description of what is supposed to happen", which is meaningful whether or not anything has actually been run/queued. But I don't want to get hung up on UI bikesheds, so maybe we should just stick with adding a "No job has been submitted..." sentence.

#25 Updated by Radhika Chippada about 3 years ago

Addressed some of the comments in note 24.

where does the "uuid of repository, if any" logic really belong? It seems to be specific to Job, so presumably it should be in JobWorkUnit

Since we display pipeline components using a JobWorkUnit or a ProxyWorkUnit (when only a component is available and no job yet submitted), to display script, script_version etc, these need to be in the ProxyWorkUnit. So, as I explained above the get(:x) method return nil if an object type does not support a specific attribute and will return the correct value when applicable.

"f wu.state_label == 'Complete'" or " 'It would be better if the view could say "<%= wu.link_to_log %>"'

I introduced several view helper methods such as link_to_log, queuedtime etc so that the views do not have to do much of this determination. I think some of them such as "This pipeline has run for ..." seem to be an overkill moving them into the work_unit, but anyways. Hopefully it is not too messy.

Suggest using each_with_index do |c, idx| here. <% @descendent_count += 1 %>

We can have a scenario where a parent with 2 children with 2 children each. So, using index does not work. I used this to set unique panel ids for the child components. Anyways, minor detail.

#26 Updated by Tom Clegg about 3 years ago

Rather than having two methods can_cancel? and can_be_canceled?, could JobWorkUnit#can_cancel? just return state.in? ['Queued', 'Running']?

views/work_unit/_show_child has comments about "column offset" which I suspect aren't accurate. I think it will be easier to follow (both when reading the code and when seeing the rendered version) if the conditional display stuff goes inside the "div col-md-X" elements where possible:

div class=col-md-3
  if walltime and cputime
    render_runtime
  end
/div

This should help us avoid weird display bugs where certain combinations of decision flags cause the columns to get misaligned.

self.my_children should probably be @my_children

some of them such as "This pipeline has run for ..." seem to be an overkill

It looks like you only moved the word choices like "ran for" / "has run for" into the model, but how about the actual runtime calculation? Perhaps work unit methods like "explain_runtime" and "explain_state" could return entire sentences, like "This blah has been active for foobar. It has used foobar of node allocation time"?

parent with 2 children with 2 children each

Aha, that makes sense, I didn't see that.

Since we display pipeline components using a JobWorkUnit or a ProxyWorkUnit (when only a component is available and no job yet submitted), to display script, script_version etc, these need to be in the ProxyWorkUnit

Why do we use ProxyWorkUnit for "component with no job"? It sounds like maybe we should use JobWorkUnit? Or make a class PipelineComponentWorkUnit < JobWorkUnit if we need different behavior?

I think we should say "No job has been submitted yet" and "no new jobs will be submitted" (instead of "...process...") -- I know we're planning to start using the term "process" elsewhere, but in these cases we really are talking specifically about jobs. (Using JobWorkUnit instead of ProxyWorkUnit for unsubmitted jobs might address the first case automatically?)

#27 Updated by Radhika Chippada about 3 years ago

Rather than having two methods can_cancel? and can_be_canceled?, could JobWorkUnit#can_cancel? just return state.in? ['Queued', 'Running']?

Updated can_cancel?

views/work_unit/_show_child has comments about "column offset" which I suspect aren't accurate ...

Updated accordingly and removed those confusing comments

self.my_children should probably be @my_children

Updated all self.x references

It looks like you only moved the word choices like "ran for" / "has run for" into the model ...

Moved all the "show_runtime" logic into proxy. It is much easier to read now.

Why do we use ProxyWorkUnit for "component with no job"?

You are right. Updated to use JobWokrUnit for components. Since we have safe handling of unsupported / unavailable attributes in get(:x) method, this is not a problem.

I think we should say "No job has been submitted yet" and "no new jobs will be submitted" (instead of "...process...") ...

The above update resolved this as well

#28 Updated by Tom Clegg about 3 years ago

This is better, thanks.

Just have two small questions

The column spacing in _show_child still looks a little suspicious (we seem to use col-md-6, 3, or 4 depending on various conditions -- shouldn't those all be 3 in order to fill 12 cols in every row?).

Looking at 4xphq-d1hrv-lj64iwkxz7h6agh, the "hasher" job says "1m / 1m (1.0⨯)" in the header but "It ran for 0m 0s (1m queued)" in the details. Is queue time being included in the scale factor calculation?

With those fixed (or not, if they're already right and I'm just confused), LGTM

#29 Updated by Radhika Chippada about 3 years ago

The column spacing in _show_child still looks a little suspicious (we seem to use col-md-6, 3, or 4 depending on various conditions -- shouldn't those all be 3 in order to fill 12 cols in every row?).

Corrected this inconsistency which was present in the original code

Looking at 4xphq-d1hrv-lj64iwkxz7h6agh, the "hasher" job says "1m / 1m (1.0⨯)" in the header but "It ran for 0m 0s (1m queued)" in the details. Is queue time being included in the scale factor calculation?

Glad you noticed this. The cpu and running time computations were using children of an object. This was failing when there are no jobs. Corrected this to use the work unit itself when there are no children for it.

#30 Updated by Tom Clegg about 3 years ago

LGTM @ b7dbb80, thanks!

#31 Updated by Radhika Chippada about 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:d10c79192b333191796d949841ec792e61a6006c.

Also available in: Atom PDF