Bug #22158
closedReported possible issue with postinst migration failing or timing out in upgrade from 2.7.4 to 3.0.0~rc1
Description
- Duplicate the database (or the whole instance) from one of the clusters running 2.7.4 and run "apt install arvados-api-server" to see if there are any issues with the migrations.
All this needs is the rails API server and postgres (and anything else that is needed for the migration to work).
Updated by Lucas Di Pentima 3 months ago
Cloned both pirca-database
and pirca-api
instances on a separate VPC, configured everything so that this new instance talks to the new PG server, and the PG server accepts connections from it.
Ran the upgrade to 3.0.0~rc1-1 and got the following timeout error while running migrations:
... == 20240402162733 AddOutputGlobIndexToContainers: migrating =================== -- execute("DROP INDEX IF EXISTS container_requests_full_text_search_idx") -> 0.0006s -- execute("CREATE INDEX container_requests_full_text_search_idx ON container_requests USING gin(to_tsvector('english', substr(coalesce(name,'') || ' ' || coalesce(description,'') || ' ' || coalesce(properties::text,'') || ' ' || coalesce(state,'') || ' ' || coalesce(runtime_constraints::text,'') || ' ' || coalesce(environment::text,'') || ' ' || coalesce(cwd,'') || ' ' || coalesce(command::text,'') || ' ' || coalesce(output_path,'') || ' ' || coalesce(filters,'') || ' ' || coalesce(scheduling_parameters::text,'') || ' ' || coalesce(output_name,'') || ' ' || coalesce(output_properties::text,''), 0, 8000)))") -> 40.9261s == 20240402162733 AddOutputGlobIndexToContainers: migrated (88.9449s) ========= == 20240604183200 ExcludeUuidsAndHashesFromTextSearch: migrating ============== -- execute("DROP INDEX IF EXISTS collections_trgm_text_search_idx") -> 0.0038s -- execute("CREATE INDEX collections_trgm_text_search_idx ON collections USING gin(((coalesce(name,'') || ' ' || coalesce(description,'') || ' ' || coalesce(properties::text,'') || ' ' || coalesce(file_names,''))) gin_trgm_ops)") rake aborted! StandardError: An error has occurred, this and all later migrations canceled: PG::QueryCanceled: ERROR: canceling statement due to statement timeout /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `exec' ...
Updated by Lucas Di Pentima 3 months ago
22158-slow-migration-timeout @ 65d3c56
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- No
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Manually tested on pirca cloned instances
- New or changed UX/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A
On my tests, only the 20240604183200_exclude_uuids_and_hashes_from_text_search.rb
migration was timing out, but upon inspection I've found a similar one that also might be slow enough to fail: 20240820202230_exclude_container_image_from_text_search.rb
, so this change adds code to both migration so that the statement timeout is disabled while running them.
This follows the same pattern as other slow migrations we did in the past.
Updated by Tom Clegg 3 months ago
LGTM, thanks.
As a follow-up perhaps we can find a way to force ourselves to explicitly mark every future migration as either "expected to be fast" or "disable/extend timeout". Maybe a test case that demands every db/migrate/*.rb
file after date X must contain either "default timeout ok" or "set statement_timeout"?
Updated by Lucas Di Pentima 3 months ago
Tom Clegg wrote in #note-8:
LGTM, thanks.
As a follow-up perhaps we can find a way to force ourselves to explicitly mark every future migration as either "expected to be fast" or "disable/extend timeout". Maybe a test case that demands every
db/migrate/*.rb
file after date X must contain either "default timeout ok" or "set statement_timeout"?
Thanks.
Re: marking migrations as expected to be fast or not, we already have a "behaves appropriately at the intended scale" checklist item, so unless we do something like automatically disable the statement timeout on every future migration, we can fall into the same problem again.
Updated by Lucas Di Pentima 3 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|49c3e5aadc28751cc8e92d605d2d550319f83722.