Project

General

Profile

Actions

Bug #18903

closed

[user activity script] does not always initialize user object and other fixes

Added by Ward Vandewege 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/24/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Observed on one of our clusters:

User activity on XXXXX between 2022-03-22 05:00 and 2022-03-23 05:00

Traceback (most recent call last):
  File "/usr/bin/arv-user-activity", line 8, in <module>
    sys.exit(main())
  File "/usr/share/python3/dist/python3-arvados-user-activity/lib/python3.7/site-packages/arvados_user_activity/main.py", line 148, in main
    users[e["object_uuid"]].append("%s Downloaded file \"%s\" from \"%s\" (%s) (%s)" % (event_at,
KeyError: 'XXXXX-tpzed-redactedredacte'

Subtasks 1 (0 open1 closed)

Task #18904: review 18903-fix-activity-scriptResolvedWard Vandewege03/24/2022

Actions

Related issues

Related to Arvados - Bug #18943: [db] add missing created_at,uuid indexesResolvedPeter Amstutz04/01/2022

Actions
Actions #1

Updated by Ward Vandewege 3 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Ward Vandewege 3 months ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege 3 months ago

I noticed another issue. Sometimes we are logging the PDH in the collection_uuid property field:

|    id    |            uuid             |         owner_uuid          |   modified_by_client_uuid   |    modified_by_user_uuid    |         object_uuid         |          event_at          |  event_type   | summary |                                                                                                            properties                                                                                                             |         created_at         |         updated_at         |        modified_at         | object_owner_uuid 
  ---------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+----------------------------+---------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+----------------------------+----------------------------+-------------------
|  8553608 | ce8i5-57u5n-6nds5tf88tcpon4 | ce8i5-tpzed-xo2k4i24bjzwedw | ce8i5-ozdt8-g1uv539034n3e95 | ce8i5-tpzed-xo2k4i24bjzwedw | ce8i5-tpzed-xo2k4i24bjzwedw | 2022-03-23 15:32:27.445892 | file_download |         | {"collection_file_path":"banana.txt","collection_uuid":"7c25715f3a62344a85f69f0298f37cf3+52","portable_data_hash":"7c25715f3a62344a85f69f0298f37cf3+52","reqPath":"/_/banana.txt"}                                                | 2022-03-23 15:32:27.446505 | 2022-03-23 15:32:27.445892 | 2022-03-23 15:32:27.445892 | 

This happens when keep-web is accessed via the "PDH" url, which is what workbench1 does by default, e.g.:

https://7c25715f3a62344a85f69f0298f37cf3-52.collections.ce8i5.arvadosapi.com/_/banana.txt?disposition=inline

Workbench2 does it differently, it links to the "UUID" url:

https://ce8i5-4zz18-uegwkpr7zkv9u64.collections.ce8i5.arvadosapi.com/banana.txt

And that results in the a log entry with a properly filled out UUID field:

|    id    |            uuid             |         owner_uuid          |   modified_by_client_uuid   |    modified_by_user_uuid    |         object_uuid         |          event_at          |  event_type   | summary |                                                                                                            properties                                                                                                             |         created_at         |         updated_at         |        modified_at         | object_owner_uuid 
  ---------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+----------------------------+---------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+----------------------------+----------------------------+-------------------
|  8553959 | ce8i5-57u5n-vwovsgcunqx64wy | ce8i5-tpzed-xo2k4i24bjzwedw | ce8i5-ozdt8-g1uv539034n3e95 | ce8i5-tpzed-xo2k4i24bjzwedw | ce8i5-tpzed-xo2k4i24bjzwedw | 2022-03-23 15:47:49.681478 | file_download |         | {"collection_file_path":"banana.txt","collection_uuid":"ce8i5-4zz18-uegwkpr7zkv9u64","portable_data_hash":"7c25715f3a62344a85f69f0298f37cf3+52","reqPath":"/_/banana.txt"}                                                        | 2022-03-23 15:47:49.68217  | 2022-03-23 15:47:49.681478 | 2022-03-23 15:47:49.681478 |

The reason the UUID field is populated with the PDH is because keep-web uses the magic '.arvados#collection' file to get the collection object, and when that magic file is accessed via the PDH, it looks like this:

{
    "created_at": null,
    "current_version_uuid": "",
    "delete_at": null,
    "description": null,
    "etag": "",
    "file_count": 0,
    "file_size_total": 0,
    "is_trashed": false,
    "kind": "arvados#collection",
    "manifest_text": ". df3e129a722a865cc3539b4e69507bad+7+A1e68843243b1f0805125ed67bd0ed79f2ad72189@624dd322 0:7:banana.txt\n",
    "modified_at": null,
    "modified_by_client_uuid": null,
    "modified_by_user_uuid": "",
    "name": "",
    "owner_uuid": "",
    "portable_data_hash": "7c25715f3a62344a85f69f0298f37cf3+52",
    "preserve_version": false,
    "properties": null,
    "replication_confirmed": null,
    "replication_confirmed_at": null,
    "replication_desired": null,
    "storage_classes_confirmed": null,
    "storage_classes_confirmed_at": null,
    "storage_classes_desired": null,
    "trash_at": null,
    "unsigned_manifest_text": "",
    "uuid": "7c25715f3a62344a85f69f0298f37cf3+52",
    "version": 0
}

When accessed via the collection uuid, the uuid field is properly populated:

{
    "created_at": "2022-01-17T19:27:50.171775000Z",
    "current_version_uuid": "ce8i5-4zz18-uegwkpr7zkv9u64",
    "delete_at": null,
    "description": null,
    "etag": "asbyblglxbvg122yfe278zj1g",
    "file_count": 1,
    "file_size_total": 7,
    "is_trashed": false,
    "kind": "arvados#collection",
    "manifest_text": ". df3e129a722a865cc3539b4e69507bad+7+Aaa290e16e1b4492999fa1c931807f93fbfa63010@624dd3c2 0:7:banana.txt\n",
    "modified_at": "2022-03-16T15:25:05.584371000Z",
    "modified_by_client_uuid": "ce8i5-ozdt8-scctad96wtyq8sq",
    "modified_by_user_uuid": "ce8i5-tpzed-xo2k4i24bjzwedw",
    "name": "Banana collection",
    "owner_uuid": "ce8i5-j7d0g-u2ojl2izpjr1qae",
    "portable_data_hash": "7c25715f3a62344a85f69f0298f37cf3+52",
    "preserve_version": true,
    "properties": {
        "automation": "upload",
        "responsible_person_uuid": "ce8i5-tpzed-6oia07y9o4zvtl6" 
    },
    "replication_confirmed": 2,
    "replication_confirmed_at": "2022-01-17T23:03:16.767739000Z",
    "replication_desired": null,
    "storage_classes_confirmed": [
        "default" 
    ],
    "storage_classes_confirmed_at": "2022-01-17T23:03:16.767739000Z",
    "storage_classes_desired": [
        "default" 
    ],
    "trash_at": null,
    "unsigned_manifest_text": "",
    "uuid": "ce8i5-4zz18-uegwkpr7zkv9u64",
    "version": 4
}

From the perspective of the consumer of these logs, it would be better if the collection_UUID field was left blank if it can't be determined with certainty.

Though, arguably, there is a (probably quite common) case where a PDH only matches one single collection, where keep-web could fill in the collection_UUID field properly. That seems like it should be easy.

Actions #4

Updated by Ward Vandewege 3 months ago

A few fixes for the user activity script:
  • always initialize user object when printing file download/upload activity
  • work around the sometimes wonky collection_UUID field for file upload/download activity
  • formatting fix, a few output lines were not prepending a timestamp
  • handle deleted users
  • add a date range option
  • retrieve the logs a day at a time to handle huge log tables more efficiently (see e.g. https://dev.arvados.org/issues/18939#note-3, and that is after this optimization)
Simple fix for keep-web:
  • blank the collection_uuid field if it is the same as the portable_data_hash field (i.e. when the collection uuid is not immediately known)

Wrt paging day by day:

This is how expensive a query for 1 month got on one of our clusters:

arvados=# explain (analyze,buffers) SELECT  count("logs".id) FROM "logs" WHERE ((logs.created_at >= '2021-09-01 00:00:00') AND (logs.created_at < '2021-10-01 00:00:00'));
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=3683442.69..3683442.70 rows=1 width=8) (actual time=141418.543..141423.372 rows=1 loops=1)
   Buffers: shared hit=4688152 read=2577826
   I/O Timings: read=413814.160
   ->  Gather  (cost=3683442.48..3683442.69 rows=2 width=8) (actual time=141400.307..141423.362 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=4688152 read=2577826
         I/O Timings: read=413814.160
         ->  Partial Aggregate  (cost=3682442.48..3682442.49 rows=1 width=8) (actual time=141404.917..141404.918 rows=1 loops=3)
               Buffers: shared hit=4688152 read=2577826
               I/O Timings: read=413814.160
               ->  Parallel Index Scan using index_logs_on_created_at on logs  (cost=0.44..3670361.86 rows=4832246 width=4) (actual time=0.227..141012.563 rows=3899079 loops=3)
                     Index Cond: ((created_at >= '2021-09-01 00:00:00'::timestamp without time zone) AND (created_at < '2021-10-01 00:00:00'::timestamp without time zone))
                     Buffers: shared hit=4688152 read=2577826
                     I/O Timings: read=413814.160
 Planning Time: 0.124 ms
 Execution Time: 141423.412 ms
(17 rows)

That's after doing `vacuum analyze` on the logs table. You can see it spends almost all its time reading the full rows from the table for the entire date range. I don't really understand why a count(id) query apparently ends up reading the entire row for each matching entry, but that is moot; the user activity script actually reads (most of) the fields in the row anyway.

In the user activity script we do a query like the one above, but with an `order by created_at` added. Because we use `arvados.util.keyset_list_all`, that is translated in many queries. For every page, keyset_list_all does a query like this:

2022-03-28 22:00:09.644 UTC [18447] arvados@arvados LOG:  duration: 123275.497 ms  execute <unnamed>: SELECT  "logs".* FROM "logs" WHERE ((logs.created_at >= '2021-09-01 19:56:07.975436') AND (logs.uuid <> '2xpu4-57u5n-4o204806665sbmg' OR logs.uuid IS NULL) AND (logs.created_at >= '2021-09-01 00:00:00') AND (logs.created_at < '2021-10-01 00:00:00')) ORDER BY logs.created_at asc, logs.uuid asc LIMIT $1 OFFSET $2
2022-03-28 22:00:09.644 UTC [18447] arvados@arvados DETAIL:  parameters: $1 = '1000', $2 = '0'

There are 11,697,236 matching entries for the entire month of September 2021, so that's ~1170 pages of 1000 results, so ~1170 * 120s makes for .... a very very long time.

When the range is shorter (e.g. 1 day), the number of rows that match is much smaller, which means each query is much faster. Still a total of ~1100 pages to get (a few more, actually, since some will be truncated at the day boundaries), but each page is much faster. That allowed the report for that period to finish in 150 minutes. Compare a "cheap" day (1,865 rows):

arvados=# explain (analyze,buffers) SELECT  count("logs".id) FROM "logs" WHERE ((logs.created_at >= '2021-09-01 00:00:00') AND (logs.created_at < '2021-09-02 00:00:00'));
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=653.95..653.96 rows=1 width=8) (actual time=1.119..1.120 rows=1 loops=1)
   Buffers: shared hit=519
   ->  Index Scan using index_logs_on_created_at on logs  (cost=0.44..649.07 rows=1955 width=4) (actual time=0.024..0.881 rows=1865 loops=1)
         Index Cond: ((created_at >= '2021-09-01 00:00:00'::timestamp without time zone) AND (created_at < '2021-09-02 00:00:00'::timestamp without time zone))
         Buffers: shared hit=519
 Planning Time: 0.130 ms
 Execution Time: 1.148 ms
(7 rows)

And the most expensive single day (1,143,176 rows):

arvados=# explain (analyze,buffers) SELECT  count("logs".id) FROM "logs" WHERE ((logs.created_at >= '2021-09-04 00:00:00') AND (logs.created_at < '2021-09-05 00:00:00'));
                                                                                 QUERY PLAN                                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=327747.35..327747.36 rows=1 width=8) (actual time=13620.712..13622.783 rows=1 loops=1)
   Buffers: shared hit=605744 read=242862
   I/O Timings: read=39865.665
   ->  Gather  (cost=327747.14..327747.35 rows=2 width=8) (actual time=13620.705..13622.779 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=605744 read=242862
         I/O Timings: read=39865.665
         ->  Partial Aggregate  (cost=326747.14..326747.15 rows=1 width=8) (actual time=13616.950..13616.951 rows=1 loops=3)
               Buffers: shared hit=605744 read=242862
               I/O Timings: read=39865.665
               ->  Parallel Index Scan using index_logs_on_created_at on logs  (cost=0.44..325680.91 rows=426490 width=4) (actual time=1.250..13582.559 rows=381059 loops=3)
                     Index Cond: ((created_at >= '2021-09-04 00:00:00'::timestamp without time zone) AND (created_at < '2021-09-05 00:00:00'::timestamp without time zone))
                     Buffers: shared hit=605744 read=242862
                     I/O Timings: read=39865.665
 Planning Time: 0.129 ms
 Execution Time: 13622.818 ms
(17 rows)

ready for review in 1825273afecc039a21c1da575a5f156e089b0478 on branch 18903-fix-activity-script

developer-run-tests: #2995
developer-run-tests-apps-workbench-integration: #3198

Actions #5

Updated by Ward Vandewege 3 months ago

  • Release set to 46
Actions #6

Updated by Peter Amstutz 3 months ago

I'm still confused about the need for day-by-day paging logic.

The way keyset_list_all() is supposed to work, it gets one page of results (up to "limit"), and updates the filter. The filter is chosen to take advantage of the b-tree index.

It should never compute count().

If it's using the index correctly, it shouldn't need to sort anything.

The limit should allow it to bail out as soon as it hits the limit number of rows.

If it's using the index correctly, it should only ever read at most "limit" entries.

Actions #7

Updated by Peter Amstutz 3 months ago

From discussion on gitter:

The underlying problems seems to be that the index was on (created_at) but the order by was on (created_at, uuid) so it would load all the rows and then sort them.

It sounds like if we fix the index to be sorted on (created_at, uuid) then the original query is efficient, and we don't need the additional complexity of day-by-day paging?

Actions #8

Updated by Ward Vandewege 3 months ago

  • Subject changed from [user activity script] does not always initialize user object to [user activity script] does not always initialize user object and other fixes
Actions #9

Updated by Ward Vandewege 3 months ago

With the `created_at,uuid` index added to the logs table (cf #18943):

arv-user-activity with per-day paging:

root@2xpu4:~# time /usr/bin/arv-user-activity --start 2021-09-01 --end 2021-10-01 > the.report
real    85m6.562s
user    5m47.772s
sys     0m9.823s

arv-user-activity without per-day paging:

root@2xpu4:~# time /usr/bin/arv-user-activity --start 2021-09-01 --end 2021-10-01 > the.report

real    85m45.825s
user    5m51.489s
sys     0m10.372s

That's close enough that we don't need the per-day paging I think.

ready for review at c3c1f4261f61bc52dbc8fadad644520797b3f6a6 on branch 18903-fix-activity-script

developer-run-tests: #2994
developer-run-tests-apps-workbench-integration: #3205

Actions #10

Updated by Ward Vandewege 3 months ago

  • Related to Bug #18943: [db] add missing created_at,uuid indexes added
Actions #11

Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-03-30 Sprint to 2022-04-13 Sprint
Actions #12

Updated by Peter Amstutz 3 months ago

Ward Vandewege wrote:

ready for review at c3c1f4261f61bc52dbc8fadad644520797b3f6a6 on branch 18903-fix-activity-script

developer-run-tests: #2994
developer-run-tests-apps-workbench-integration: #3205

Couple small thoughts.

  • The "--end" parameter could default to "now" (then "--start" on its own is just another way of saying "--days")
  • The help text should say if --start and --end use local midnight or UTC midnight
  • getCollectionName could be explicit in reporting the oldest or youngest copy of the collection (specify order=created_at or modified_at) and limit=1 since it only expects 1 or 0 results.

rest LGTM

Actions #13

Updated by Ward Vandewege 3 months ago

Peter Amstutz wrote:

Ward Vandewege wrote:

ready for review at c3c1f4261f61bc52dbc8fadad644520797b3f6a6 on branch 18903-fix-activity-script

developer-run-tests: #2994
developer-run-tests-apps-workbench-integration: #3205

Couple small thoughts.

  • The "--end" parameter could default to "now" (then "--start" on its own is just another way of saying "--days")
  • The help text should say if --start and --end use local midnight or UTC midnight
  • getCollectionName could be explicit in reporting the oldest or youngest copy of the collection (specify order=created_at or modified_at) and limit=1 since it only expects 1 or 0 results.

rest LGTM

Thanks, merged with the last two things fixed (and a few more tweaks). I didn't introduce a default for '--end'.

Actions #14

Updated by Ward Vandewege 3 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF