Bug #11546

Expensive container lock/unlock

Added by Peter Amstutz 7 months ago. Updated 6 months ago.

Status:ResolvedStart date:05/17/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:-
Target version:2017-05-24 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

Currently the "lock" and "unlock" operations aquire a table lock. This causes ActiveRecord to reload the record. If loading the record is expensive (eg #11547) this means the lock can be held for a long time, which effectively paralyzes the containers table.

Change lock to do:

"update container set state=Locked where uuid=X and state=Queued"

and unlock:

"update container set state=Queued where uuid=X and state=Locked"

and use the number of rows updated (1 or 0) to determine success.

#<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError>
/var/www/arvados-api/current/app/models/container.rb:224:in `block in unlock'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/locking/pessimistic.rb:72:in `block in with_lock'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/connection_adapters/abstract/database_statements.rb:192:in `transaction'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/transactions.rb:208:in `transaction'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/transactions.rb:250:in `transaction'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-3.2.22.5/lib/active_record/locking/pessimistic.rb:70:in `with_lock'
/var/www/arvados-api/current/app/models/container.rb:222:in `unlock'
/var/www/arvados-api/current/app/controllers/arvados/v1/containers_controller.rb:33:in `unlock'
...
Error 1493151643+9496f4a1: 422
{"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-wu9psm8wa0fnxnf/unlock","format":"html","controller":"arvados/v1/containers","action":"unlock","status":422,"duration":88029.34,"view":0.58,"db":1109.73,"params":{},"@timestamp":"2017-04-25T20:20:43.942319294Z","@version":"1","message":"[422] POST /arvados/v1/containers/qr1hi-dz642-wu9psm8wa0fnxnf/unlock (arvados/v1/containers#unlock)"}

Duration of this request is 88 seconds!


Subtasks

Task #11587: Review 11546-fast-lockResolvedTom Clegg


Related issues

Related to Arvados - Bug #11547: [API] Excessive memory and time consumed by API and dispa... New
Related to Arvados - Bug #11549: [CWL] Avoid mounting individual files Resolved 04/27/2017

Associated revisions

Revision b3f88934
Added by Tom Clegg 6 months ago

Merge branch '11546-fast-lock'

closes #11546

History

#1 Updated by Peter Amstutz 7 months ago

  • Description updated (diff)

#2 Updated by Peter Amstutz 7 months ago

The first unlock operation (this one doesn't fail):

{"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/unlock","format":"html","controller":"arvados/v1/containers","action":"unlock","status":200,"duration":146246.43,"view":0.46,"db":8538.25,"params":{},"@timestamp":"2017-04-25T18:34:49.242027342Z","@version":"1","message":"[200] POST /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/unlock (arvados/v1/containers#unlock)"}

#3 Updated by Peter Amstutz 7 months ago

  • Subject changed from Expensive unlock operation? to Expensive container lock/unlock

The lock operation also takes a really long time???

{"method":"POST","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/lock","format":"html","controller":"arvados/v1/containers","action":"lock","status":200,"duration":201764.38,"view":8.28,"db":15356.04,"params":{},"@timestamp":"2017-04-25T18:32:22.443079042Z","@version":"1","message":"[200] POST /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t/lock (arvados/v1/containers#lock)"}

#4 Updated by Peter Amstutz 7 months ago

Just doing a GET on the record takes a long time:

{"method":"GET","path":"/arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t","format":"html","controller":"arvados/v1/containers","action":"show","status":200,"duration":44999.11,"view":0.43,"db":598.06,"params":
{"select":"[\"uuid\"]"},"@timestamp":"2017-04-25T18:29:18.760545042Z","@version":"1","message":"[200] GET /arvados/v1/containers/qr1hi-dz642-itcctbtq2z1lb8t (arvados/v1/containers#show)"}

#5 Updated by Peter Amstutz 7 months ago

The details for this are on #11547

#6 Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
  • Status changed from New to Duplicate

#7 Updated by Peter Amstutz 7 months ago

  • Description updated (diff)

#8 Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
  • Status changed from Duplicate to New

#9 Updated by Nico C├ęsar 7 months ago

as per https://github.com/rails/rails/issues/26570 suggestion

arvados_production=# select * from pg_locks where 
    granted and relation = 'containers'::regclass \x\g\x
Expanded display is on.
-[ RECORD 1 ]------+-------------
locktype           | relation
database           | 7878680
relation           | 20204982
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 21/29602
pid                | 62426
mode               | RowShareLock
granted            | t
fastpath           | t

Expanded display is off.
arvados_production=# select * from pg_stat_activity where pid=62426 \x\g\x
Expanded display is on.
-[ RECORD 1 ]----+------------------------------------------------------------------------------------------
datid            | 7878680
datname          | arvados_production
pid              | 62426
usesysid         | 16385
usename          | arvados
application_name | Passenger RubyApp: /var/www/qr1hi.arvadosapi.com/current/public
client_addr      | ::1
client_hostname  | 
client_port      | 33714
backend_start    | 2017-04-27 18:39:16.540269+00
xact_start       | 2017-04-27 19:29:32.712103+00
query_start      | 2017-04-27 19:29:32.713118+00
state_change     | 2017-04-27 19:29:33.128494+00
waiting          | f
state            | idle in transaction
query            | SELECT  "containers".* FROM "containers"  WHERE "containers"."id" = $1 LIMIT 1 FOR UPDATE

Expanded display is off.

#10 Updated by Peter Amstutz 7 months ago

  • Status changed from New to In Progress
  • Assignee set to Peter Amstutz
  • Target version set to 2017-05-10 sprint

#11 Updated by Peter Amstutz 6 months ago

  • Target version changed from 2017-05-10 sprint to 2017-05-24 sprint

#12 Updated by Peter Amstutz 6 months ago

ActiveRecord with_lock is implemented using "SELECT FOR UPDATE" to acquire a row level write lock for duration of transaction. Because it is implemented as a SELECT it also does an unscoped reload of the record.

#13 Updated by Peter Amstutz 6 months ago

Using a direct row update is awkward because it bypasses all the normal ActiveRecord machinery: permission_to_update, validate_lock, assign_auth, update_modified_by_fields and probably others have to be called explicitly (or equivalent behavior added to the lock/unlock methods).

It seems like it probably works to load and save a partial record using Container.select(:uuid, :state).where(uuid: blah)

The workaround might be to load the partial record, call selected methods, then instead of using save to perform the update directly. Maybe there is a way to get the update statement that would be use by save!?

#14 Updated by Peter Amstutz 6 months ago

  • Assignee changed from Peter Amstutz to Tom Clegg

#15 Updated by Tom Clegg 6 months ago

11546-fast-lock @ f3d300ce629f1c17ccc83bf1978ff5c7dd41e33d

Still needs some tests, but does this seem like a decent approach?

The locking SQL looks like this; note no "mounts" mentioned at all in the log message.

   (0.2ms)  SELECT clock_timestamp()
   (0.5ms)  UPDATE containers SET state='Locked', auth_uuid='zzzzz-gj3su-15qbgtwcs4qxqy2', locked_by_uuid='zzzzz-gj3su-k9dvestay1plssr', modified_at='2017-05-17T14:25:08.670955000
Z' WHERE uuid='zzzzz-dz642-queuedcontainer' AND state='Queued' and priority>0
   (0.2ms)  SELECT clock_timestamp()
  SQL (0.5ms)  INSERT INTO "logs" ("properties", "event_type", "object_uuid", "object_owner_uuid", "summary", "event_at", "owner_uuid", "created_at", "updated_at", "modified_at", 
"modified_by_user_uuid", "modified_by_client_uuid", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id"  [["properties", "{\"old_etag\":\"bprchq
sgg9qcnyi0zqqzbhs2d\",\"old_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"modified_at\":null,\"modified_
by_user_uuid\":null,\"state\":\"Queued\",\"priority\":1,\"updated_at\":\"2016-01-11T11:11:11.111111000Z\"},\"new_etag\":\"b3myd6e85bmm2ksouva471zw4\",\"new_attributes\":{\"id\":20
2999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"modified_at\":\"2017-05-17T14:25:08.670955000Z\",\"modified_by_user_uuid\":\"zzzz
z-tpzed-000000000000000\",\"state\":\"Locked\",\"priority\":1,\"updated_at\":\"2017-05-17T14:25:08.670955000Z\",\"auth_uuid\":\"zzzzz-gj3su-15qbgtwcs4qxqy2\",\"locked_by_uuid\":\" 
zzzzz-gj3su-k9dvestay1plssr\"}}"], ["event_type", "update"], ["object_uuid", "zzzzz-dz642-queuedcontainer"], ["object_owner_uuid", "zzzzz-tpzed-000000000000000"], ["summary", "upd
ate of zzzzz-dz642-queuedcontainer"], ["event_at", "2017-05-17 14:25:08.670955"], ["owner_uuid", "zzzzz-tpzed-000000000000000"], ["created_at", "2017-05-17 14:25:08.673620"], ["up
dated_at", "2017-05-17 14:25:08.673620"], ["modified_at", "2017-05-17 14:25:08.673620"], ["modified_by_user_uuid", "zzzzz-tpzed-000000000000000"], ["modified_by_client_uuid", "zzz
zz-ozdt8-obw7foaks3qjyej"], ["uuid", "zzzzz-57u5n-ollmb661jukx2ix"]]
   (0.2ms)  NOTIFY logs, '1315'

I didn't bother giving "unlock" the same treatment yet (should I?). I did add a couple of checks so at least we don't do the expensive "select * for update" operation before even checking the obvious non-race error cases.

#16 Updated by Tom Clegg 6 months ago

Another approach would be to go back to the obvious Rails locking mechanism, but use "select for share nowait".

https://www.postgresql.org/docs/9.1/static/sql-select.html#SQL-FOR-UPDATE-SHARE

This way, when there are many concurrent lock, get, and update calls in progress:
  • most lock calls would fail immediately due to "nowait"
  • at most one lock call would obtain the row lock and proceed (albeit still slowly, because Rails wants to reload the entire record)
  • all get calls would read the unlocked record and proceed, even though the lock transaction is in progress (functionally this seems fine since nobody has been told that a lock was obtained)
  • all update calls would wait for the successful lock call (if any), and one another, to finish before proceeding, as usual

#17 Updated by Tom Clegg 6 months ago

11546-fast-lock @ fbc576b76977938cf7b742f9770ab90559136dc8
  • threw away the "circumvent AR update" code
  • used "for update nowait" (not "for share nowait" -- that doesn't block other "for share nowait" until we actually run "update") -- this means when multiple callers try to lock, at most one of them will do an expensive AR reload; the others will fail immediately (instead of waiting in line and doing their own expensive AR reloads only to discover someone else had taken the lock, which was the previous behavior)
  • kept the "don't load irrelevant fields before getting the lock" bit
  • added "don't return irrelevant fields in lock/unlock responses"
  Container Load (0.5ms)  SELECT  "containers"."id", "containers"."uuid", "containers"."state", "containers"."priority", "containers"."auth_uuid", "containers"."locked_by_uuid" FROM "containers" WHERE (containers.uuid='zzzzz-dz642-queuedcontainer')  ORDER BY "containers"."id" ASC LIMIT 1 OFFSET 0
  Container Load (0.5ms)  SELECT  "containers".* FROM "containers" WHERE "containers"."id" = $1 LIMIT 1 FOR UPDATE NOWAIT  [["id", 202999634]]
   (0.2ms)  SAVEPOINT active_record_1

#18 Updated by Peter Amstutz 6 months ago

I think maybe you need a transaction? Looks like normally AR starts its transaction only once you call save and it starts running through its hooks and stuff.

See http://api.rubyonrails.org/v4.2/classes/ActiveRecord/Locking/Pessimistic.html#method-i-with_lock when using with_lock it wraps it in a transaction.

  transaction do
    begin
      reload(lock: 'FOR UPDATE NOWAIT')
    rescue
      raise LockFailedError.new("cannot lock: other transaction in progress")
    end
    check_lock_fail
    update_attributes!(state: Locked)
  end

#19 Updated by Tom Clegg 6 months ago

#20 Updated by Peter Amstutz 6 months ago

So an actual idiomatic way of writing this would be:

begin
  with_lock (lock: 'FOR UPDATE NOWAIT') do
    check_lock_fail
    update_attributes!(state: Locked)
  end
rescue => whatever error you get from a lock conflict
  raise LockFailedError.new("cannot lock: other transaction in progress")
end

Except for the part where we need to know which exception type to catch on lock conflict.

#21 Updated by Peter Amstutz 6 months ago

LGTM 11546-fast-lock @ 78390940029ab08efecab923615c545b9da922b4

#22 Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:b3f88934721fc97973b40786cae97e55e84b31d5.

Also available in: Atom PDF