Project

General

Profile

Actions

Feature #12167

closed

[SDKs] [API] [keepstore] [keepproxy] Facilitate request tracing for all services

Added by Nico César over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release:
Release relationship:
Auto

Description

Background

When an arv-put process is slow or broken, it is useful to grep logs for related server-side activity. To make this easy, arv-put should generate a "request ID" at startup, and all components should propagate this ID to log files and other Arvados components.

Specific improvements

  • A request ID looks like "req-" + 20 random alphanum chars, e.g., req-iNjNZUDqU35yxSXFYJ8x
  • (Done) If running in verbose mode, arv-put should log the request ID to stderr as soon as it's generated.
  • Whether running in verbose mode or not, arv-put should log the request ID in every error log. Presumably the easiest way to do this is to include it in every log message, just like timestamp and PID. Thus, a failed arv-put will always log its request ID, because every failure mode causes a log (make sure this is true even for SIGINT).
  • (Done) Each keepstore/keepproxy/API request made by arv-put should include a request header like X-Request-Id: req-iNjNZUDqU35yxSXFYJ8x (no ETag-esque quoting).
  • (Done) When keepproxy receives an X-Request-Id request header, it should propagate that same header when it makes keepstore requests in the course of servicing the incoming request instead of inventing its own. Keepproxy's request log should include whichever request ID was used. Keepproxy's generated IDs should be updated to the above format.
  • keepproxy should also send the X-Request-Id when making API requests.
  • (Done) When keepstore receives an X-Request-Id request header, it should include the header value in the corresponding request log; otherwise it should generate a random one.
  • (Done) When API server receives an X-Request-Id request header, it should include the header value in the corresponding request log; otherwise it should generate a random one.
  • (Done) Workbench should include its request ID in an X-Request-Id header instead of a "current_request_id" query parameter. (Current code tries to do this, but it doesn't work. Rails middleware is probably overwriting it.)
  • (Done) Workbench's generated IDs should also be updated to the above format.
  • (Done) When Workbench receives an X-Request-Id request header, it should use that header value instead of generating a new request ID.
  • keep-web, too.

Of course, incoming headers can be broken or malicious, so logs should use appropriate quoting.


Subtasks 6 (0 open6 closed)

Task #12493: Review 12167-workbench-request-idResolvedTom Clegg08/24/2017Actions
Task #12544: Tracing for keepstore, keepproxy, Go & Python SDKsResolvedTom Clegg08/24/2017Actions
Task #12579: Review 12167-keep-request-idResolvedTom Clegg08/24/2017Actions
Task #12777: Review 12167-python-request-idResolvedTom Clegg08/24/2017Actions
Task #13227: Review 12167-arvput-log-on-signalsResolvedTom Clegg08/24/2017Actions
Task #13396: Review 12167-go-request-idResolvedTom Clegg08/24/2017Actions

Related issues

Related to Arvados - Idea #12512: [Workbench] Implement X-Request-ID to enable request tracingResolvedTom Clegg10/30/2017Actions
Related to Arvados - Idea #13147: [arv-put] Add request ID to requests for tracabilityResolvedLucas Di Pentima03/06/2018Actions
Actions #1

Updated by Tom Clegg over 6 years ago

  • Subject changed from [keep] [keepproxy] [API] [arv-put] make a "current_request_id" in arv-put just like in workbench to [SDKs] [API] [keepstore] [keepproxy] Propagate and log request IDs to permit tracing of arv-put processes
  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 6 years ago

  • Subject changed from [SDKs] [API] [keepstore] [keepproxy] Propagate and log request IDs to permit tracing of arv-put processes to [SDKs] [API] [keepstore] [keepproxy] Facilitate request tracing for arv-put
  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 6 years ago

  • Tracker changed from Bug to Feature
  • Project changed from 35 to Arvados
Actions #4

Updated by Tom Morris over 6 years ago

  • Target version set to Arvados Future Sprints
Actions #5

Updated by Tom Clegg over 6 years ago

This should be done in such a way that all clients can benefit from it, although the one we're most interested in is arv-put. Presumably, this means the SDK generates one request ID per process and passes it in all API calls; and arv-put gets it from the SDK in order to log it at all the right times.

Actions #6

Updated by Tom Morris over 6 years ago

  • Story points set to 2.0
Actions #7

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #8

Updated by Tom Morris over 6 years ago

I'm not sure I see the value of having the backends (Keep & API servers) generate random request IDs rather than just using a fixed string like "None" or leaving it blank. If the logging is noticeably different we can track where we need to add additional request IDs on the client side.

Actions #9

Updated by Tom Morris over 6 years ago

  • Target version changed from Arvados Future Sprints to 2017-11-08 Sprint
Actions #10

Updated by Tom Morris over 6 years ago

  • Assigned To set to Tom Clegg
Actions #11

Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg over 6 years ago

Workbench and API server portions:

12167-workbench-request-id @ 5f46fe83d8a15048c4a5b90553c379803e99b135

Actions #13

Updated by Lucas Di Pentima over 6 years ago

The updates at branch 12167-workbench-request-id LGTM.
If the req-id is created from the API when not provided, would it be useful to add another prefix to identify which component created the id? For example: req-api-xxxxx, req-wb-xxxx, and arv-put would be able to use it's own req-put-xxxx. I believe this would allow us do what TomM said on note-8, identify whenever there're clients not creating their own ids.
Also, it would allow us to extract some statistics to compare clients' "chattiness".

Actions #14

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #15

Updated by Tom Clegg over 6 years ago

  • Target version changed from 2017-11-08 Sprint to 2017-11-22 Sprint
Actions #16

Updated by Tom Clegg over 6 years ago

  • Target version changed from 2017-11-22 Sprint to 2017-12-06 Sprint
Actions #17

Updated by Tom Clegg over 6 years ago

12167-keep-request-id @ b21dbd5c2fec1c8c61d232b2cf7f0a68240d9203

todo: merge #12483 to avoid conflicts in httpserver, then restore TTFB timing in logs

Actions #18

Updated by Tom Clegg over 6 years ago

restored TTFB timing without merge conflicts by adding a separate responseTimer wrapper around ResponseWriter. Could combine those two after merge.

12167-keep-request-id @ 4412d92e672d4df12ffc04c2b747b750e313cae9
  • keepstore logs incoming request ID if any, otherwise a generated one
  • keepproxy logs incoming request ID if any, otherwise a generated one, and propagates it to keepstore requests
  • keepproxy logs in JSON
Actions #19

Updated by Lucas Di Pentima over 6 years ago

Some questions/comments:

  • There are a couple of tests failing because of logging_router.go disappearance.
  • The story description requires an ID of ‘req-‘ + 20 random alphanum chars, should IDGenerator be updated to work like this? (if that’s a hard requirement)
  • Should these headers be truncated to some max amount of chars to avoid potentially malicious activity?
  • Some tests may be needed
Actions #20

Updated by Tom Clegg over 6 years ago

Lucas Di Pentima wrote:

  • There are a couple of tests failing because of logging_router.go disappearance.

Fixed keepstore tests.

  • The story description requires an ID of ‘req-‘ + 20 random alphanum chars, should IDGenerator be updated to work like this? (if that’s a hard requirement)

Indeed, timestamps are not very random, nor do they reach 20 chars in base 36. Fixed.

START: logger_test.go:28: Suite.TestLogRequests
map[string]interface {}{"reqMethod":"GET", "time":"2017-11-29T10:56:04.996722515-05:00", "RequestID":"req-e254kixtfyxeaeyzgf3b", "msg":"request", "reqBytes":0, "reqPath":"bar", "level":"info", "remoteAddr":"", "reqForwardedFor":"1.2.3.4:12345"}
map[string]interface {}{"level":"info", "remoteAddr":"", "timeToStatus":0.000227, "timeTotal":0.000246, "respBytes":11, "time":"2017-11-29T10:56:04.996982684-05:00", "msg":"response", "reqPath":"bar", "timeWriteBody":1.8e-05, "RequestID":"req-e254kixtfyxeaeyzgf3b", "reqBytes":0, "reqForwardedFor":"1.2.3.4:12345", "reqMethod":"GET", "respStatus":"", "respStatusCode":0}
PASS: logger_test.go:28: Suite.TestLogRequests  0.001s
  • Should these headers be truncated to some max amount of chars to avoid potentially malicious activity?

Go's http server has a MaxHeaderBytes limit, default 1MB. I guess this is as safe as anything else in the request that gets logged.

  • Some tests may be needed

Added a logging test.

12167-keep-request-id @ 20c3fbd046f726590774c5d7a1897835080a34e3

Actions #21

Updated by Lucas Di Pentima over 6 years ago

This lgtm, thanks!

Actions #22

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #23

Updated by Tom Clegg over 6 years ago

  • Target version changed from 2017-12-06 Sprint to 2017-12-20 Sprint
Actions #24

Updated by Tom Clegg over 6 years ago

12167-python-request-id @ b7eae78616dd72ddda597f414f3c208b099c0a27

$ arv-put --no-cache output.txt
2017-12-08 10:32:19 arvados.arv_put[19849] INFO: X-Request-Id: req-33mgkwu9wqvbqufrab5d
2017-12-08 10:32:19 arvados.arv_put[19849] INFO: No cache usage requested for this run.
0M / 0M 100.0% 2017-12-08 10:32:21 arvados.arv_put[19849] INFO:

2017-12-08 10:32:21 arvados.arv_put[19849] INFO: Collection saved as 'Saved at 2017-12-08 15:32:19 UTC by tom@nelle'
9tee4-4zz18-elcxwboktnhm5wc

grep API logs for req-33mgkwu9wqvbqufrab5d:

{"method":"GET","path":"/arvados/v1/users/current","format":"json","controller":"Arvados::V1::UsersController","action":"current","status":200,"duration":5.6,"view":0.35,"db":0.34,"request_id":"req-33mgkwu9wqvbqufrab5d","client_ipaddr":"24.67.204.71","client_auth":"9tee4-gj3su-55u9mmetngu39fd","params":{"alt":"json"},"@timestamp":"2017-12-08T15:32:19.621322066Z","@version":"1","message":"[200] GET /arvados/v1/users/current (Arvados::V1::UsersController#current)"}
{"method":"GET","path":"/arvados/v1/keep_services/accessible","format":"json","controller":"Arvados::V1::KeepServicesController","action":"accessible","status":200,"duration":6.18,"view":0.44,"db":1.81,"request_id":"req-33mgkwu9wqvbqufrab5d","client_ipaddr":"24.67.204.71","client_auth":"9tee4-gj3su-55u9mmetngu39fd","params":{"alt":"json"},"@timestamp":"2017-12-08T15:32:19.752575351Z","@version":"1","message":"[200] GET /arvados/v1/keep_services/accessible (Arvados::V1::KeepServicesController#accessible)"}
{"method":"POST","path":"/arvados/v1/collections","format":"json","controller":"Arvados::V1::CollectionsController","action":"create","status":200,"duration":41.96,"view":0.43,"db":25.88,"request_id":"req-33mgkwu9wqvbqufrab5d","client_ipaddr":"24.67.204.71","client_auth":"9tee4-gj3su-55u9mmetngu39fd","params":{"manifest_text":". 875ec5b819be5be78bf7ac2eb9ba1417+18+A6a9cc649be1699084ad83d2997a9d4d723a733a8@5a3d271b 0:18:output.txt\n","owner_uuid":"9tee4-tpzed-rneoswozqjk5tcv","name":"Saved at 2017-12-08 15:32:19 UTC by tom@nelle","replication_desired":null,"ensure_unique_name":"true","alt":"json","collection":{"owner_uuid":"9tee4-tpzed-rneoswozqjk5tcv","replication_desired":null,"manifest_text":". 875ec5b819be5be78bf7ac2eb9ba1417+18+A6a9cc649be1699084ad83d2997a9d4d723a733a8@5a3d271b 0:18:output.txt\n","name":"Saved at 2017-12-08 15:32:19 UTC by tom@nelle"}},"@timestamp":"2017-12-08T15:32:21.061234728Z","@version":"1","message":"[200] POST /arvados/v1/collections (Arvados::V1::CollectionsController#create)"}

keep0 keepstore:

2017-12-08_15:34:06.77212 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"request","remoteAddr":"10.100.32.7:38904","reqBytes":18,"reqForwardedFor":"","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","time":"2017-12-08T15:34:06.771978647Z"}
2017-12-08_15:34:06.84978 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"response","remoteAddr":"10.100.32.7:38904","reqBytes":18,"reqForwardedFor":"","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","respBytes":87,"respStatus":"","respStatusCode":0,"time":"2017-12-08T15:34:06.849590663Z","timeToStatus":0.077579,"timeTotal":0.077599,"timeWriteBody":0.000020}

keep1 keepstore:

2017-12-08_15:39:07.37169 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"request","remoteAddr":"10.100.32.7:55000","reqBytes":18,"reqForwardedFor":"","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","time":"2017-12-08T15:39:07.371533438Z"}
2017-12-08_15:39:07.45356 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"response","remoteAddr":"10.100.32.7:55000","reqBytes":18,"reqForwardedFor":"","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","respBytes":87,"respStatus":"","respStatusCode":0,"time":"2017-12-08T15:39:07.453427752Z","timeToStatus":0.081692,"timeTotal":0.081911,"timeWriteBody":0.000219}

keepproxy:

2017-12-08_15:34:06.72716 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"request","remoteAddr":"127.0.0.1:33230","reqBytes":18,"reqForwardedFor":"24.67.204.71","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","time":"2017-12-08T15:34:06.727034881Z"}
2017-12-08_15:34:06.85703 {"RequestID":"req-33mgkwu9wqvbqufrab5d","level":"info","msg":"response","remoteAddr":"127.0.0.1:33230","reqBytes":18,"reqForwardedFor":"24.67.204.71","reqMethod":"PUT","reqPath":"875ec5b819be5be78bf7ac2eb9ba1417","respBytes":86,"respStatus":"","respStatusCode":0,"time":"2017-12-08T15:34:06.856949973Z","timeToStatus":0.129809,"timeTotal":0.129891,"timeWriteBody":0.000082}
Actions #25

Updated by Tom Clegg over 6 years ago

Added tests.

12167-python-request-id @ 4453022a516e2b1deb30a71d8ee811d6593c44c3

Actions #26

Updated by Lucas Di Pentima over 6 years ago

Just a couple of comments:

  • Shouldn't the KeepXRequestId class be called KeepXRequestIdTestCase just for code style purposes?
  • Wouldn’t be convenient to add a couple of tests to confirm that arv-get & arv-put select a properly formatted req-id (just in case we change the arvados.util.new_request_id() call to something else in the future)?

Apart from that, LGTM.

Actions #27

Updated by Tom Clegg over 6 years ago

12167-python-request-id @ 34105f94a3b9a93d3efd1452c1e5e9643a4ed1dc
  • added arv-get and arv-put tests
  • renamed KeepXRequestId to KeepXRequestIdTestCase
Actions #28

Updated by Lucas Di Pentima over 6 years ago

LGTM. Thanks!

Actions #29

Updated by Tom Clegg over 6 years ago

  • Related to Idea #12512: [Workbench] Implement X-Request-ID to enable request tracing added
Actions #30

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #31

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #32

Updated by Tom Morris over 6 years ago

  • Target version changed from 2017-12-20 Sprint to 2018-01-17 Sprint
Actions #33

Updated by Tom Clegg over 6 years ago

  • Subject changed from [SDKs] [API] [keepstore] [keepproxy] Facilitate request tracing for arv-put to [SDKs] [API] [keepstore] [keepproxy] Facilitate request tracing for all services
  • Story points changed from 2.0 to 1.0
Actions #34

Updated by Tom Morris over 6 years ago

  • Target version changed from 2018-01-17 Sprint to 2018-01-31 Sprint
Actions #35

Updated by Tom Clegg over 6 years ago

  • Target version changed from 2018-01-31 Sprint to Arvados Future Sprints
Actions #36

Updated by Tom Morris about 6 years ago

  • Assigned To deleted (Tom Clegg)
  • Target version changed from Arvados Future Sprints to 2018-03-14 Sprint
Actions #37

Updated by Tom Morris about 6 years ago

  • Related to Idea #13147: [arv-put] Add request ID to requests for tracability added
Actions #38

Updated by Tom Morris about 6 years ago

  • Assigned To set to Lucas Di Pentima
Actions #39

Updated by Lucas Di Pentima about 6 years ago

Pending tasks

  • Right now arv-put is logging the request-id on the first error message log, confirm if this is enough or should be logged on every error message.
    • Also, arv-put isn't logging the request id when handling SIGINT.
  • Ruby client support.
  • Keep-web with related Go SDK updates.
  • Confirm if keepproxy is sending the X-Request-Id when making API requests. (as it's not marked as "done" on the description)
Actions #40

Updated by Lucas Di Pentima about 6 years ago

  • Target version changed from 2018-03-14 Sprint to 2018-03-28 Sprint
Actions #41

Updated by Lucas Di Pentima about 6 years ago

Branch 12167-arvput-log-on-signals - 247f218e49da73cfcd5b27a4990f16e3a3231ddd
Test run: https://ci.curoverse.com/job/developer-run-tests/657/

arv-put logs the request id before exiting because of SIGINT, SIGTERM & SIGQUIT.

Ran the test locally 100 times and didn't got any failure, so although it uses time.sleep(), I think it's pretty safe. Instead of waiting 1 second before sending the signal, I tried to capture some stdout from the process prior to sending it but I always got no output from pipe.stdout.read(), even if I waited several secs.

Actions #42

Updated by Tom Clegg about 6 years ago

select.select(..., 10) seems to work reliably in place of time.sleep(1).

I also replaced "yes" with "sleep 10". With "yes", failing tests seemed to result in a python process trying to OOM-kill the host.

12167-arvput-log-on-signals @ ea8013a2d0cd0faeeec132f5d950fb3a5461d6ca

If that looks good to you, then LGTM. Thanks.

Actions #43

Updated by Lucas Di Pentima about 6 years ago

Very cool idea, thanks! I'll merge to master.

Actions #44

Updated by Lucas Di Pentima about 6 years ago

  • Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
Actions #45

Updated by Lucas Di Pentima about 6 years ago

  • Target version changed from 2018-04-11 Sprint to 2018-04-25 Sprint
Actions #46

Updated by Tom Clegg about 6 years ago

  • Assigned To changed from Lucas Di Pentima to Tom Clegg
Actions #47

Updated by Tom Clegg about 6 years ago

  • Target version changed from 2018-04-25 Sprint to 2018-05-09 Sprint
Actions #48

Updated by Tom Clegg almost 6 years ago

12167-go-request-id @ 7d96319e476c866f33f119dddd534fa6cbe482fb
  • keep-web logs requests and responses as JSON via logrus (catching up with keepstore, keepproxy, ws)
  • keep-web logs the incoming request-id, if any (otherwise a newly generated one)
  • keep-web propagates request-id in outgoing API and keep requests
  • keepproxy propagates request-id in outgoing API (which it didn't before) and keep requests (which it did before using its custom http client, but now does with the new sdk/go/keepclient facility)

https://ci.curoverse.com/job/developer-run-tests/700/

Actions #49

Updated by Lucas Di Pentima almost 6 years ago

Some comments/questions:

  • Some tests on services/keep-web are failing (cadaver_test.go file) on my local run. On Jenkins it seems that the 'tests-remainder' job was cancelled
  • File sdk/go/arvados/client.go
    • Lines 109-112: I think the new reqid retrieval/generation could be moved inside the conditional at line 114 to avoid unnecessary work
    • Line 253: What would happen if reqid is an empty string? I believe it could be the case when an older client doesn’t provide the appropriate X-Request-Id header, for example at services/keep-web/handler.go:431
  • File sdk/go/keepclient/keepclient.go
    • Line 551: Should kc.RequestID be assigned with the newly generated Id also, to avoid a new id on every setRequestHeaders call?
  • Do you think adding some tests would be convenient?
Actions #50

Updated by Tom Clegg almost 6 years ago

Lucas Di Pentima wrote:

  • Some tests on services/keep-web are failing (cadaver_test.go file) on my local run. On Jenkins it seems that the 'tests-remainder' job was cancelled

Could be the $HOME problem we fixed in #13111 by updating an old (<1.9) Go runtime. I've merged master here, so run-tests.sh should demand Go 1.10 now.

  • File sdk/go/arvados/client.go
    • Lines 109-112: I think the new reqid retrieval/generation could be moved inside the conditional at line 114 to avoid unnecessary work

Indeed, fixed.

  • Line 253: What would happen if reqid is an empty string? I believe it could be the case when an older client doesn’t provide the appropriate X-Request-Id header, for example at services/keep-web/handler.go:431

In that specific case the header has been set by httpserver.AddRequestIDs() but I've updated client.go so WithRequestID("") causes a new ID to be generated for each request.

  • File sdk/go/keepclient/keepclient.go
    • Line 551: Should kc.RequestID be assigned with the newly generated Id also, to avoid a new id on every setRequestHeaders call?

Ah yes, this was generating a new request ID for each outgoing request to service a single Get or Put. If the caller hasn't set a request ID on the client object, I think what we want is a new ID for each call to Get, Put, etc. I've updated accordingly.

  • Do you think adding some tests would be convenient?

Yes! Done.

12167-go-request-id @ da570c74ba7949b5e5cc89cfa2db2af603890137

Actions #51

Updated by Lucas Di Pentima almost 6 years ago

This LGTM, thanks!

Actions #52

Updated by Tom Clegg almost 6 years ago

  • Status changed from In Progress to Resolved
Actions #53

Updated by Tom Morris almost 6 years ago

  • Release set to 13
Actions

Also available in: Atom PDF