Bug #11546
closedExpensive container lock/unlock
Description
Currently the "lock" and "unlock" operations aquire a table lock. This causes ActiveRecord to reload the record. If loading the record is expensive (eg #11547) this means the lock can be held for a long time, which effectively paralyzes the containers table.
Change lock to do:
"update container set state=Locked where uuid=X and state=Queued"
and unlock:
"update container set state=Queued where uuid=X and state=Locked"
and use the number of rows updated (1 or 0) to determine success.
#<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> /var/www/arvados-api/current/app/models/container.rb:224:in `block in unlock' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/locking/pessimistic.rb:72:in `block in with_lock' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/connection_adapters/abstract/database_statements.rb:192:in `transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/transactions.rb:208:in `transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/transactions.rb:250:in `transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/locking/pessimistic.rb:70:in `with_lock' /var/www/arvados-api/current/app/models/container.rb:222:in `unlock' /var/www/arvados-api/current/app/controllers/arvados/v1/containers_controller.rb:33:in `unlock' ... Error 1493151643+9496f4a1: 422 {"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-wu9psm8wa0fnxnf/unlock","format":"html","controller":"arvados/v1/containers","action":"unlock","status":422,"duration":88029.34,"view":0.58,"db":1109.73,"params":{},"@timestamp":"2017-04-25T20:20:43.942319294Z","@version":"1","message":"[422] POST /arvados/v1/containers/qr1hi-dz642-wu9psm8wa0fnxnf/unlock (arvados/v1/containers#unlock)"}
Duration of this request is 88 seconds!
Updated by Peter Amstutz over 7 years ago
The first unlock operation (this one doesn't fail):
{"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/unlock","format":"html","controller":"arvados/v1/containers","action":"unlock","status":200,"duration":146246.43,"view":0.46,"db":8538.25,"params":{},"@timestamp":"2017-04-25T18:34:49.242027342Z","@version":"1","message":"[200] POST /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/unlock (arvados/v1/containers#unlock)"}
Updated by Peter Amstutz over 7 years ago
- Subject changed from Expensive unlock operation? to Expensive container lock/unlock
The lock operation also takes a really long time???
{"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/lock","format":"html","controller":"arvados/v1/containers","action":"lock","status":200,"duration":201764.38,"view":8.28,"db":15356.04,"params":{},"@timestamp":"2017-04-25T18:32:22.443079042Z","@version":"1","message":"[200] POST /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/lock (arvados/v1/containers#lock)"}
Updated by Peter Amstutz over 7 years ago
Just doing a GET on the record takes a long time:
{"method":"GET","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t","format":"html","controller":"arvados/v1/containers","action":"show","status":200,"duration":44999.11,"view":0.43,"db":598.06,"params": {"select":"[\"uuid\"]"},"@timestamp":"2017-04-25T18:29:18.760545042Z","@version":"1","message":"[200] GET /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t (arvados/v1/containers#show)"}
Updated by Peter Amstutz over 7 years ago
- Description updated (diff)
- Status changed from New to Duplicate
Updated by Peter Amstutz over 7 years ago
- Description updated (diff)
- Status changed from Duplicate to New
Updated by Nico César over 7 years ago
as per https://github.com/rails/rails/issues/26570 suggestion
arvados_production=# select * from pg_locks where granted and relation = 'containers'::regclass \x\g\x Expanded display is on. -[ RECORD 1 ]------+------------- locktype | relation database | 7878680 relation | 20204982 page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | 21/29602 pid | 62426 mode | RowShareLock granted | t fastpath | t Expanded display is off. arvados_production=# select * from pg_stat_activity where pid=62426 \x\g\x Expanded display is on. -[ RECORD 1 ]----+------------------------------------------------------------------------------------------ datid | 7878680 datname | arvados_production pid | 62426 usesysid | 16385 usename | arvados application_name | Passenger RubyApp: /var/www/qr1hi.arvadosapi.com/current/public client_addr | ::1 client_hostname | client_port | 33714 backend_start | 2017-04-27 18:39:16.540269+00 xact_start | 2017-04-27 19:29:32.712103+00 query_start | 2017-04-27 19:29:32.713118+00 state_change | 2017-04-27 19:29:33.128494+00 waiting | f state | idle in transaction query | SELECT "containers".* FROM "containers" WHERE "containers"."id" = $1 LIMIT 1 FOR UPDATE Expanded display is off.
Updated by Peter Amstutz over 7 years ago
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
- Target version set to 2017-05-10 sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Updated by Peter Amstutz over 7 years ago
ActiveRecord with_lock
is implemented using "SELECT FOR UPDATE" to acquire a row level write lock for duration of transaction. Because it is implemented as a SELECT it also does an unscoped reload of the record.
Updated by Peter Amstutz over 7 years ago
Using a direct row update is awkward because it bypasses all the normal ActiveRecord machinery: permission_to_update, validate_lock, assign_auth, update_modified_by_fields and probably others have to be called explicitly (or equivalent behavior added to the lock/unlock methods).
It seems like it probably works to load and save a partial record using Container.select(:uuid, :state).where(uuid: blah)
The workaround might be to load the partial record, call selected methods, then instead of using save
to perform the update directly. Maybe there is a way to get the update statement that would be use by save!
?
Updated by Peter Amstutz over 7 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg over 7 years ago
11546-fast-lock @ f3d300ce629f1c17ccc83bf1978ff5c7dd41e33d
Still needs some tests, but does this seem like a decent approach?
The locking SQL looks like this; note no "mounts" mentioned at all in the log message.
(0.2ms) SELECT clock_timestamp() (0.5ms) UPDATE containers SET state='Locked', auth_uuid='zzzzz-gj3su-15qbgtwcs4qxqy2', locked_by_uuid='zzzzz-gj3su-k9dvestay1plssr', modified_at='2017-05-17T14:25:08.670955000 Z' WHERE uuid='zzzzz-dz642-queuedcontainer' AND state='Queued' and priority>0 (0.2ms) SELECT clock_timestamp() SQL (0.5ms) INSERT INTO "logs" ("properties", "event_type", "object_uuid", "object_owner_uuid", "summary", "event_at", "owner_uuid", "created_at", "updated_at", "modified_at", "modified_by_user_uuid", "modified_by_client_uuid", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id" [["properties", "{\"old_etag\":\"bprchq sgg9qcnyi0zqqzbhs2d\",\"old_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"modified_at\":null,\"modified_ by_user_uuid\":null,\"state\":\"Queued\",\"priority\":1,\"updated_at\":\"2016-01-11T11:11:11.111111000Z\"},\"new_etag\":\"b3myd6e85bmm2ksouva471zw4\",\"new_attributes\":{\"id\":20 2999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"modified_at\":\"2017-05-17T14:25:08.670955000Z\",\"modified_by_user_uuid\":\"zzzz z-tpzed-000000000000000\",\"state\":\"Locked\",\"priority\":1,\"updated_at\":\"2017-05-17T14:25:08.670955000Z\",\"auth_uuid\":\"zzzzz-gj3su-15qbgtwcs4qxqy2\",\"locked_by_uuid\":\" zzzzz-gj3su-k9dvestay1plssr\"}}"], ["event_type", "update"], ["object_uuid", "zzzzz-dz642-queuedcontainer"], ["object_owner_uuid", "zzzzz-tpzed-000000000000000"], ["summary", "upd ate of zzzzz-dz642-queuedcontainer"], ["event_at", "2017-05-17 14:25:08.670955"], ["owner_uuid", "zzzzz-tpzed-000000000000000"], ["created_at", "2017-05-17 14:25:08.673620"], ["up dated_at", "2017-05-17 14:25:08.673620"], ["modified_at", "2017-05-17 14:25:08.673620"], ["modified_by_user_uuid", "zzzzz-tpzed-000000000000000"], ["modified_by_client_uuid", "zzz zz-ozdt8-obw7foaks3qjyej"], ["uuid", "zzzzz-57u5n-ollmb661jukx2ix"]] (0.2ms) NOTIFY logs, '1315'
I didn't bother giving "unlock" the same treatment yet (should I?). I did add a couple of checks so at least we don't do the expensive "select * for update" operation before even checking the obvious non-race error cases.
Updated by Tom Clegg over 7 years ago
Another approach would be to go back to the obvious Rails locking mechanism, but use "select for share nowait".
https://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-FOR-UPDATE-SHARE
This way, when there are many concurrent lock, get, and update calls in progress:- most lock calls would fail immediately due to "nowait"
- at most one lock call would obtain the row lock and proceed (albeit still slowly, because Rails wants to reload the entire record)
- all get calls would read the unlocked record and proceed, even though the lock transaction is in progress (functionally this seems fine since nobody has been told that a lock was obtained)
- all update calls would wait for the successful lock call (if any), and one another, to finish before proceeding, as usual
Updated by Tom Clegg over 7 years ago
- threw away the "circumvent AR update" code
- used "for update nowait" (not "for share nowait" -- that doesn't block other "for share nowait" until we actually run "update") -- this means when multiple callers try to lock, at most one of them will do an expensive AR reload; the others will fail immediately (instead of waiting in line and doing their own expensive AR reloads only to discover someone else had taken the lock, which was the previous behavior)
- kept the "don't load irrelevant fields before getting the lock" bit
- added "don't return irrelevant fields in lock/unlock responses"
Container Load (0.5ms) SELECT "containers"."id", "containers"."uuid", "containers"."state", "containers"."priority", "containers"."auth_uuid", "containers"."locked_by_uuid" FROM "containers" WHERE (containers.uuid='zzzzz-dz642-queuedcontainer') ORDER BY "containers"."id" ASC LIMIT 1 OFFSET 0 Container Load (0.5ms) SELECT "containers".* FROM "containers" WHERE "containers"."id" = $1 LIMIT 1 FOR UPDATE NOWAIT [["id", 202999634]] (0.2ms) SAVEPOINT active_record_1
Updated by Peter Amstutz over 7 years ago
I think maybe you need a transaction? Looks like normally AR starts its transaction only once you call save and it starts running through its hooks and stuff.
See http://api.rubyonrails.org/v4.2/classes/ActiveRecord/Locking/Pessimistic.html#method-i-with_lock when using with_lock it wraps it in a transaction.
transaction do begin reload(lock: 'FOR UPDATE NOWAIT') rescue raise LockFailedError.new("cannot lock: other transaction in progress") end check_lock_fail update_attributes!(state: Locked) end
Updated by Tom Clegg over 7 years ago
indeed
11546-fast-lock @ 78390940029ab08efecab923615c545b9da922b4
Updated by Peter Amstutz over 7 years ago
So an actual idiomatic way of writing this would be:
begin with_lock (lock: 'FOR UPDATE NOWAIT') do check_lock_fail update_attributes!(state: Locked) end rescue => whatever error you get from a lock conflict raise LockFailedError.new("cannot lock: other transaction in progress") end
Except for the part where we need to know which exception type to catch on lock conflict.
Updated by Peter Amstutz over 7 years ago
LGTM 11546-fast-lock @ 78390940029ab08efecab923615c545b9da922b4
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:b3f88934721fc97973b40786cae97e55e84b31d5.