Project

General

Profile

Actions

Bug #22158

closed

Reported possible issue with postinst migration failing or timing out in upgrade from 2.7.4 to 3.0.0~rc1

Added by Peter Amstutz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
-
Release:
Release relationship:
Auto

Description

  • Duplicate the database (or the whole instance) from one of the clusters running 2.7.4 and run "apt install arvados-api-server" to see if there are any issues with the migrations.

All this needs is the rails API server and postgres (and anything else that is needed for the migration to work).


Subtasks 1 (0 open1 closed)

Task #22194: Review 22158-slow-migration-timeoutResolvedTom Clegg10/10/2024Actions
Actions #1

Updated by Peter Amstutz 3 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 3 months ago

  • Subject changed from Reported possible issue with postinst migration not happening in upgrade from 2.7.4 to 3.0.0~rc1 to Reported possible issue with postinst migration failing or timing out in upgrade from 2.7.4 to 3.0.0~rc1
Actions #3

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Actions #4

Updated by Peter Amstutz 2 months ago

  • Release set to 70
Actions #5

Updated by Peter Amstutz 2 months ago

  • Assigned To set to Lucas Di Pentima
  • Description updated (diff)
Actions #6

Updated by Lucas Di Pentima 2 months ago

Cloned both pirca-database and pirca-api instances on a separate VPC, configured everything so that this new instance talks to the new PG server, and the PG server accepts connections from it.

Ran the upgrade to 3.0.0~rc1-1 and got the following timeout error while running migrations:

...
== 20240402162733 AddOutputGlobIndexToContainers: migrating ===================
-- execute("DROP INDEX IF EXISTS container_requests_full_text_search_idx")
   -> 0.0006s
-- execute("CREATE INDEX container_requests_full_text_search_idx ON container_requests USING gin(to_tsvector('english', substr(coalesce(name,'') || ' ' || coalesce(description,'') || ' ' || coalesce(properties::text,'') || ' ' || coalesce(state,'') || ' ' || coalesce(runtime_constraints::text,'') || ' ' || coalesce(environment::text,'') || ' ' || coalesce(cwd,'') || ' ' || coalesce(command::text,'') || ' ' || coalesce(output_path,'') || ' ' || coalesce(filters,'') || ' ' || coalesce(scheduling_parameters::text,'') || ' ' || coalesce(output_name,'') || ' ' || coalesce(output_properties::text,''), 0, 8000)))")
   -> 40.9261s
== 20240402162733 AddOutputGlobIndexToContainers: migrated (88.9449s) =========

== 20240604183200 ExcludeUuidsAndHashesFromTextSearch: migrating ==============
-- execute("DROP INDEX IF EXISTS collections_trgm_text_search_idx")
   -> 0.0038s
-- execute("CREATE INDEX collections_trgm_text_search_idx ON collections USING gin(((coalesce(name,'') || ' ' || coalesce(description,'') || ' ' || coalesce(properties::text,'') || ' ' || coalesce(file_names,''))) gin_trgm_ops)")
rake aborted!
StandardError: An error has occurred, this and all later migrations canceled:

PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `exec'
...
Actions #7

Updated by Lucas Di Pentima 2 months ago

22158-slow-migration-timeout @ 65d3c56

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • No
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Manually tested on pirca cloned instances
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • N/A
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • N/A
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
    • N/A

On my tests, only the 20240604183200_exclude_uuids_and_hashes_from_text_search.rb migration was timing out, but upon inspection I've found a similar one that also might be slow enough to fail: 20240820202230_exclude_container_image_from_text_search.rb, so this change adds code to both migration so that the statement timeout is disabled while running them.
This follows the same pattern as other slow migrations we did in the past.

Actions #8

Updated by Tom Clegg 2 months ago

LGTM, thanks.

As a follow-up perhaps we can find a way to force ourselves to explicitly mark every future migration as either "expected to be fast" or "disable/extend timeout". Maybe a test case that demands every db/migrate/*.rb file after date X must contain either "default timeout ok" or "set statement_timeout"?

Actions #9

Updated by Lucas Di Pentima 2 months ago

Tom Clegg wrote in #note-8:

LGTM, thanks.

As a follow-up perhaps we can find a way to force ourselves to explicitly mark every future migration as either "expected to be fast" or "disable/extend timeout". Maybe a test case that demands every db/migrate/*.rb file after date X must contain either "default timeout ok" or "set statement_timeout"?

Thanks.

Re: marking migrations as expected to be fast or not, we already have a "behaves appropriately at the intended scale" checklist item, so unless we do something like automatically disable the statement timeout on every future migration, we can fall into the same problem again.

Actions #10

Updated by Lucas Di Pentima 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF