https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-12-03T19:13:46ZArvadosArvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=694242018-12-03T19:13:46ZTom Cleggtom@curii.com
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Feature</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=694252018-12-03T19:44:51ZTom Cleggtom@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/69425/diff?detail_id=66538">diff</a>)</li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=695672018-12-06T16:05:01ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Postgres text search has other problems when it comes to segmenting filenames. However I don't think that means we give up postgres, should create our own filename search table that has the behavior we want.</p>
<p><a class="external" href="https://dev.arvados.org/issues/13752#note-7">https://dev.arvados.org/issues/13752#note-7</a></p>
<p>Proposed solution:</p>
<p>Maintain our own filename index in a new table.</p>
<p>keyword → collection PDH</p>
<p>Perform custom filename tokenizing and support prefix search with "like" (can use B-tree indexes). Split on symbols like "_", "-" and ".", CamelCase (lower&arr;upper transitions). Convert everything to lowercase. For example:</p>
<p>"Sample_RMF1U7F_S27_R1_001.fastq.gz"</p>
<p>Would turn into:</p>
<p>"sample_rmf1u7f_s27_r1_001.fastq.gz" <br />"rmf1u7f_s27_r1_001.fastq.gz" <br />"s27_r1_001.fastq.gz" <br />"r1_001.fastq.gz" <br />"001.fastq.gz" <br />"fastq.gz" <br />"gz"</p>
<p>Searches would be prefix searches, eg a search on "RMF1U7F" would be <code>keyword like 'rmf1u7f%'</code></p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=715752019-02-20T15:10:54ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Table of:</p>
<p>(filename, portable data hash of collection, path in collection, file size, content hash)</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=715982019-02-20T20:20:57ZTom Cleggtom@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/71598/diff?detail_id=68641">diff</a>)</li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=715992019-02-20T20:30:18ZTom Cleggtom@curii.com
<ul><li><strong>Subject</strong> changed from <i>[API] Fully functional filename search</i> to <i>[Spike] [API] Fully functional filename search</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/71599/diff?detail_id=68643">diff</a>)</li><li><strong>Target version</strong> changed from <i>To Be Groomed</i> to <i>Arvados Future Sprints</i></li><li><strong>Story points</strong> set to <i>2.0</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=716012019-02-20T20:30:32ZTom Cleggtom@curii.com
<ul><li><strong>Has duplicate</strong> <i><a class="issue tracker-6 status-7 priority-4 priority-default closed" href="/issues/13508">Idea #13508</a>: Fix postgres search for filenames</i> added</li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=716042019-02-20T20:31:04ZTom Cleggtom@curii.com
<ul><li><strong>Has duplicate</strong> <i><a class="issue tracker-6 status-7 priority-4 priority-default closed" href="/issues/14611">Idea #14611</a>: [Epic] Site-wide search for text, filenames, data</i> added</li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=723382019-03-13T15:45:07ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2019-03-27 Sprint</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=723392019-03-13T15:45:32ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Lucas Di Pentima</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728282019-03-27T15:14:46ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-03-27 Sprint</i> to <i>2019-04-10 Sprint</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728302019-03-27T15:16:44ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> changed from <i>Lucas Di Pentima</i> to <i>Peter Amstutz</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728952019-03-28T22:45:13ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Strategy 1:</p>
<p>Create two tables:</p>
<pre>
create table filenames (pdh text, filename text);
create table filetokens (filename text, token text);
</pre>
<p>The "filenames" table has a PDH and each file path contained in the manifest.</p>
<p>The "filetokens" table has a full file path and substring starting from each break point (such as punctuation).</p>
<p>Next, create indexes:</p>
<p><Pre><br />CREATE INDEX filenames_idx ON filenames (filename text_pattern_ops);</p>
<p>CREATE INDEX filetokens_idx ON filetokens (token text_pattern_ops);<br /></pre></p>
<p>Now we can search filenames by doing a prefix search on the filetokens table (this is efficient with the "text_pattern_ops" index) and joining with the "filenames" table to get the associated manifest PDHs.</p>
<pre>
explain analyze select pdh, filenames.filename from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%'
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.12..146749.41 rows=103706 width=101) (actual time=0.254..118.289 rows=16541 loops=1)
-> Index Scan using filetokens_idx on filetokens (cost=0.56..8.58 rows=2290 width=68) (actual time=0.209..25.212 rows=5599 loops=1)
Index Cond: ((token ~>=~ 'hu589D0B'::text) AND (token ~<~ 'hu589D0C'::text))
Filter: (token ~~ 'hu589D0B%'::text)
-> Index Scan using filenames_idx on filenames (cost=0.56..63.85 rows=23 width=101) (actual time=0.014..0.016 rows=3 loops=5599)
Index Cond: (filename = filetokens.filename)
Planning time: 1.489 ms
Execution time: 119.139 ms
(8 rows)
</pre>
<p>It took 119ms but it turns out that's because there's a lot of files with 'hu589D0C':</p>
<pre>
arvados_development=> select count(*) from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%'
;
count
-------
16541
</pre>
<p>We can isolate it to distinct manifest:</p>
<pre>
select count(distinct pdh) from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%';
count
-------
36
</pre>
<p>Getting unique manifests is a bit quicker compared to getting the whole list of files (note there's no index on the pdh column, I don't know if that would help):</p>
<pre>
explain analyze select count(distinct pdh) from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%';
Aggregate (cost=147008.67..147008.68 rows=1 width=8) (actual time=155.506..155.507 rows=1 loops=1)
-> Nested Loop (cost=1.12..146749.41 rows=103706 width=40) (actual time=0.027..145.026 rows=16541 loops=1)
-> Index Scan using filetokens_idx on filetokens (cost=0.56..8.58 rows=2290 width=68) (actual time=0.012..7.127 rows=5599 loops=1)
Index Cond: ((token ~>=~ 'hu589D0B'::text) AND (token ~<~ 'hu589D0C'::text))
Filter: (token ~~ 'hu589D0B%'::text)
-> Index Scan using filenames_idx on filenames (cost=0.56..63.85 rows=23 width=101) (actual time=0.021..0.023 rows=3 loops=5599)
Index Cond: (filename = filetokens.filename)
Planning time: 0.715 ms
Execution time: 155.543 ms
</pre>
<p>If we limit the number of rows returned, the query is much faster:</p>
<pre>
arvados_development=> explain analyze select pdh, filenames.filename from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%' limit 50
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.12..71.87 rows=50 width=101) (actual time=0.039..0.250 rows=50 loops=1)
-> Nested Loop (cost=1.12..146749.41 rows=103706 width=101) (actual time=0.038..0.246 rows=50 loops=1)
-> Index Scan using filetokens_idx on filetokens (cost=0.56..8.58 rows=2290 width=68) (actual time=0.016..0.033 rows=13 loops=1)
Index Cond: ((token ~>=~ 'hu589D0B'::text) AND (token ~<~ 'hu589D0C'::text))
Filter: (token ~~ 'hu589D0B%'::text)
-> Index Scan using filenames_idx on filenames (cost=0.56..63.85 rows=23 width=101) (actual time=0.012..0.015 rows=4 loops=13)
Index Cond: (filename = filetokens.filename)
Planning time: 1.144 ms
Execution time: 0.280 ms
</pre>
<p>One drawback of this approach is that the "tokens" table gets pretty large, with about 60% of the collections on qr1hi loaded there's about 29 million tokens:</p>
<pre>
select count(*) from filetokens;
count
----------
29426500
</pre>
<p>Also, although our custom tokenizing logic is much better suited to filenames than full text search, it could still happen that someone wants to search for a substring that doesn't start at one of our chosen token break points.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728962019-03-28T22:45:23ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>See <a class="external" href="https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations">https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations</a></p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728972019-03-28T22:53:13ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Based on the SO post another thing I'm going to look at is an index using postgres trigram operators:</p>
<p><a class="external" href="https://www.postgresql.org/docs/current/pgtrgm.html">https://www.postgresql.org/docs/current/pgtrgm.html</a></p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=728982019-03-28T23:04:04ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>I feel like there might be other tricks with operator indexes that could accomplish what we want as well, but that will require more research.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=729112019-03-29T14:21:38ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>install postgresql-contrib</p>
<p>create extension pg_trgm;</p>
<p>CREATE INDEX filenames_trgm_idx ON public.filenames USING gin (filename public.gin_trgm_ops);</p>
<pre>
arvados_development=# explain analyze select * from filenames where filename like '%hu589D0B%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on filenames (cost=77.82..971.51 rows=235 width=101) (actual time=3.495..6.328 rows=5585 loops=1)
Recheck Cond: (filename ~~ '%hu589D0B%'::text)
Heap Blocks: exact=157
-> Bitmap Index Scan on filenames_trgm_idx (cost=0.00..77.76 rows=235 width=0) (actual time=3.443..3.443 rows=5585 loops=1)
Index Cond: (filename ~~ '%hu589D0B%'::text)
Planning time: 0.274 ms
Execution time: 6.731 ms
(7 rows)
</pre>
<p>Searching on filenames with the trigram index:</p>
<pre>
explain analyze select filename, pdh from filenames where filename ilike '%hu589D0B%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on filenames (cost=77.82..971.51 rows=235 width=101) (actual time=3.485..9.892 rows=5585 loops=1)
Recheck Cond: (filename ~~* '%hu589D0B%'::text)
Heap Blocks: exact=157
-> Bitmap Index Scan on filenames_trgm_idx (cost=0.00..77.76 rows=235 width=0) (actual time=3.460..3.460 rows=5585 loops=1)
Index Cond: (filename ~~* '%hu589D0B%'::text)
Planning time: 0.869 ms
Execution time: 10.121 ms
(7 rows)
</pre>
<p>This is faster than the token+join method.</p>
<pre>
explain analyze select filenames.filename, pdh from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.12..150551.94 rows=104061 width=101) (actual time=0.049..70.267 rows=16541 loops=1)
-> Index Scan using filetokens_idx on filetokens (cost=0.56..8.58 rows=2290 width=68) (actual time=0.020..4.339 rows=5599 loops=1)
Index Cond: ((token ~>=~ 'hu589D0B'::text) AND (token ~<~ 'hu589D0C'::text))
Filter: (token ~~ 'hu589D0B%'::text)
-> Index Scan using filenames_idx on filenames (cost=0.56..65.50 rows=24 width=101) (actual time=0.010..0.011 rows=3 loops=5599)
Index Cond: (filename = filetokens.filename)
Planning time: 1.462 ms
Execution time: 70.985 ms
(8 rows)
</pre>
<p>In fact it's worse than that, the token+join method returns duplicate results, so it needs select distinct:</p>
<pre>
select distinct(filenames.filename, pdh) from filetokens inner join filenames on filenames.filename=filetokens.filename where token like 'hu589D0B%' order by (filenames.filename, pdh);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=161715.90..162236.20 rows=104061 width=32) (actual time=242.187..248.489 rows=5585 loops=1)
-> Sort (cost=161715.90..161976.05 rows=104061 width=32) (actual time=242.187..243.733 rows=16541 loops=1)
Sort Key: (ROW(filenames.filename, filenames.pdh))
Sort Method: external merge Disk: 1976kB
-> Nested Loop (cost=1.12..150551.94 rows=104061 width=32) (actual time=0.083..68.332 rows=16541 loops=1)
-> Index Scan using filetokens_idx on filetokens (cost=0.56..8.58 rows=2290 width=68) (actual time=0.031..4.088 rows=5599 loops=1)
Index Cond: ((token ~>=~ 'hu589D0B'::text) AND (token ~<~ 'hu589D0C'::text))
Filter: (token ~~ 'hu589D0B%'::text)
-> Index Scan using filenames_idx on filenames (cost=0.56..65.50 rows=24 width=101) (actual time=0.009..0.010 rows=3 loops=5599)
Index Cond: (filename = filetokens.filename)
Planning time: 2.323 ms
Execution time: 249.045 ms
</pre>
<p>So the results lean strongly towards the trigram index</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=729152019-03-29T14:39:42ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>IT looks like the simplest solution for filename search, would not even require introducing any new tables, is:</p>
<ol>
<li>drop the fulltext index from collections.file_names</li>
<li>add a trigram GIN index on collections.file_names</li>
<li>update clients to use ilike '%blah%' to do filename searches <em>OR</em> adjust the behavior of our '@@' operator to be implemented by either fulltext or ilike searches depending on the record type.</li>
</ol> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=729182019-03-29T15:13:45ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>One benefit of a separate "files" table (as described in the original story description) is that the extra file count and file size columns being added in <a class="issue tracker-6 status-3 priority-4 priority-default closed parent" title="Idea: [API Server] Return collection size and number of files in collection record (Resolved)" href="https://dev.arvados.org/issues/14484">#14484</a> become unnecessary, they could be easily implemented using count(*) and sum(filesize) SQL aggregates.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=730222019-04-03T14:37:22ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Whatever query workbench uses, API need to transform into query appropriate for filename search.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731182019-04-05T17:56:45ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>The e51c5 database has 4 million collections. The largest collection has a filename list just under 2 MB in size. With a seq scan, it takes 15 seconds to find a substring. With a trigram index, it takes 11ms.</p>
<pre>
arvados_development=> select count(*) from collections;
count
---------
4228696
arvados_development=> select max(length(file_names)) from collections;
max
---------
1962256
arvados_development=> select portable_data_hash from collections where file_names ilike '%AJ5LM4NX7CA%';
Time: 14399.272 ms
arvados_development=> CREATE INDEX collection_filenames_trgm_idx ON collections USING gin (file_names public.gin_trgm_ops);
CREATE INDEX
Time: 158394.288 ms
arvados_development=> select portable_data_hash from collections where file_names ilike '%AJ5LM4NX7CA%';
Time: 11.429 ms
arvados_development=> select portable_data_hash from collections where file_names ilike '%Sample_AJ5LM4NX7CA%';
Time: 29.370 ms
arvados_development=> select count(uuid) from collections where file_names ilike '%Sample_%';
count
--------
724071
(1 row)
Time: 9363.349 ms
arvados_development=> select portable_data_hash from collections where file_names ilike '%AJ5LM4NX7CA%' limit 1000;
Time: 4.203 ms
</pre>
<p>I notice the query time seems to be affected by uniqueness of the substring, so searching for "Sample_AJ5LM4NX7CA" (where "Sample_" is an incredibly common substring, appearing in 724071 rows) is slower than searching for just the sample id "AJ5LM4NX7CA". However with a limit on the number of rows to return, the query times are only in the double digits of milliseconds -- either the query is relatively unique and only returns a few rows (and so the index lookup is very quick), or there are many hits, and it hits the limit very quickly (using "offset" to page through the results is a bit slower, but only when there are 10s of thousands of results).</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731192019-04-05T18:17:31ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Looks very promising! Thanks for the data. A couple of additional questions:</p>
<ul>
<li>What's the size of the index?</li>
<li>How does that compare to the size of the indexed text ( sum(length(file_names)) )?</li>
<li>Does the file_names column include everything or is it capped at some limit? (Already answered in chat - Everything)</li>
</ul>
<p>It looks to me like this is a good approach, but let's also get the rest of the team to review. We can either do that in the context of grooming an implementation ticket or just do it informally and roll this over into an implementation ticket.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731212019-04-05T18:35:13ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><pre>
arvados_development=> select pg_size_pretty(pg_total_relation_size('collections'));
pg_size_pretty
----------------
16 GB
arvados_development=> select pg_size_pretty(sum(length(manifest_text))) from collections;
pg_size_pretty
----------------
8294 MB
arvados_development=> select pg_size_pretty(sum(length(file_names))) from collections;
pg_size_pretty
----------------
743 MB
arvados_development=> select pg_size_pretty(pg_relation_size('collection_filenames_trgm_idx'));
pg_size_pretty
----------------
368 MB
</pre>
<p>So, the collections table is 16 GB, half of which is manifest_text, the file_names are about 10% the size of manifest_text, and trigram index is about half the size of file_names.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731252019-04-05T19:48:21ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><pre>
arvados_development=> select portable_data_hash from collections where manifest_text ilike '%AJ5LM4NX7CA%';
Time: 120136.877 ms
arvados_development=> CREATE INDEX collection_manifest_trgm_idx ON collections USING gin (manifest_text public.gin_trgm_ops);
CREATE INDEX
Time: 2100488.609 ms
arvados_development=> select portable_data_hash from collections where manifest_text ilike '%AJ5LM4NX7CA%';
Time: 41.068 ms
arvados_development=> select pg_size_pretty(pg_total_relation_size('collections'));
pg_size_pretty
----------------
19 GB
arvados_development=> select pg_size_pretty(pg_relation_size('collection_manifest_trgm_idx'));
pg_size_pretty
----------------
3191 MB
arvados_development=> select count(uuid) from collections where manifest_text ilike '%Sample_%';
count
--------
724071
(1 row)
Time: 85809.407 ms
</pre>
<p>As an experiment, I tried adding a trigram index on the manifest_text column. Building the index takes a very long time. The index itself is pretty large (seems roughly linear on the total amount of data in the column). Queries are a bit slower.</p>
<p>However it looks like having a file_names column + index adds roughly 1 GB compared to 3 GB for an index on manifest_text.</p>
<p>So for filename search, this is strictly worse. Although, it would add the capability to look up which manifests a block hash appears in.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731262019-04-05T20:17:44ZTom Cleggtom@curii.com
<ul></ul><p>Looks very promising.</p>
<p>It looks like this is available in the postgresql-contrib package in debian 9. Would be good to confirm availability & procedure needed to enable it (if any) on all supported platforms.</p>
<p>I expect we will want something like "search all columns for xyz", not just "search file_names for xyz". I'm guessing we can use trgm to index all searchable columns, including jsonb, so <code>["any","like","%foo%"]</code> would be fast, and could be used to find collection filenames, container mount filenames, etc...? If this works, could Workbench skip the full-text search entirely, and avoid the trouble of merging results from multiple search queries?</p>
<p>The current (FT) search implementation interprets "foo bar" as "foo & bar". Does this translate to <code>[["any","like","%foo bar%"]]</code> here, or <code>[["any","like","%foo%"],["any","like","%bar%"]]</code> -- and would those queries perform well too?</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731612019-04-08T14:07:14ZTom Cleggtom@curii.com
<ul></ul><p>It would be good to confirm the search is functional (i.e., returns good results for user queries) in addition to being fast. I'm not sure of the best way to do this, or how rigorous we want to get, but it seems like we should check how well the results match for some obvious/typical queries on filenames as well as other kinds of fields that get searched.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=731982019-04-10T15:04:35ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-04-10 Sprint</i> to <i>2019-04-24 Sprint</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733302019-04-16T17:03:48ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><blockquote>
<p>It looks like this is available in the postgresql-contrib package in debian 9. Would be good to confirm availability & procedure needed to enable it (if any) on all supported platforms.</p>
</blockquote>
<p>Postgres trigram index has existed for a while, however support specifically for accelerating 'LIKE' queries may have only been introduced in 9.1 (that's when it appears in the documentation).</p>
<p>Debian has the postgresql-contrib going back to Jessie (Debian 8) with Postgres 9.4 <a class="external" href="https://packages.debian.org/search?keywords=postgresql-contrib&searchon=names&suite=all&section=all">https://packages.debian.org/search?keywords=postgresql-contrib&searchon=names&suite=all&section=all</a></p>
<p>Ubuntu has the postgresql-contrib package going back to trusty (14.04) with Postgres 9.3 <a class="external" href="https://packages.ubuntu.com/search?keywords=postgresql-contrib&searchon=all&suite=all&section=all">https://packages.ubuntu.com/search?keywords=postgresql-contrib&searchon=all&suite=all&section=all</a></p>
<p>Centos7 packages available at postgresql.org include postgresql-contrib <a class="external" href="https://www.postgresql.org/download/linux/redhat/">https://www.postgresql.org/download/linux/redhat/</a></p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733312019-04-16T17:59:09ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><pre>
select uuid from collections where file_names ilike '%foo%';
Time: 48.968 ms
select uuid from collections where file_names ilike '%foo%' and file_names ilike '%bar%';
Time: 53.635 ms
</pre>
<pre>
CREATE INDEX properties_trgm_idx ON public.collections USING gin (properties public.gin_trgm_ops);
ERROR: operator class "public.gin_trgm_ops" does not accept data type jsonb
CREATE INDEX properties_trgm_idx ON public.collections USING gin ((properties::text) public.gin_trgm_ops);
CREATE INDEX
Time: 66620.780 ms
arvados_development=> select properties from collections where properties::text ilike '%<a href="https://arvadosapi.com/e51c5-xvhdp-p48azvvjrbhmbf7">e51c5-xvhdp-p48azvvjrbhmbf7</a>%';
properties
------------------------------------------------------------------------
{"type": "log", "container_request": "<a href="https://arvadosapi.com/e51c5-xvhdp-p48azvvjrbhmbf7">e51c5-xvhdp-p48azvvjrbhmbf7</a>"}
{"type": "output", "container_request": "<a href="https://arvadosapi.com/e51c5-xvhdp-p48azvvjrbhmbf7">e51c5-xvhdp-p48azvvjrbhmbf7</a>"}
(2 rows)
Time: 67.937 ms
arvados_development=> select properties from collections where properties::text ilike '%foo%';
Time: 6.889 ms
</pre> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733332019-04-16T18:48:45ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><blockquote>
<p>I expect we will want something like "search all columns for xyz", not just "search file_names for xyz". I'm guessing we can use trgm to index all searchable columns, including jsonb, so ["any","like","%foo%"] would be fast, and could be used to find collection filenames, container mount filenames, etc...? If this works, could Workbench skip the full-text search entirely, and avoid the trouble of merging results from multiple search queries?</p>
</blockquote>
<p>If we take the approach of changing the implementation of '<code>@', since queries are only allowed to be in the form [["any", "</code>@", "foo bar"]] then we can construct the search based on each column type and mix both 'ilike' and '@@' in the 'where' clause (so no merging results problem).</p>
<p>In addition to using ilike against individual columns, we can use ilike on a multi-column expression index (same approach used for full text):</p>
<pre>
CREATE INDEX merged_trgm_idx ON public.collections USING gin ((file_names || ' ' || properties::text) public.gin_trgm_ops);
CREATE INDEX
Time: 175172.242 ms
select file_names, properties from collections where (file_names || ' ' || properties::text) ilike '%176DEMH%';
Time: 7.256 ms
</pre>
<p>Using full text indexing for <em>some</em> columns would mean continuing to support language-based parsing/stemming/normalization/stop words/etc. The main situation I see where this would be useful would be in searching description fields that contain substantial prose and the user doesn't know exactly what she is looking for (ranking by full text score would be useful in that case, but we don't support that). With 'ilike' you would be able to find exact substring matches in descriptions, but not words that differ by changes in tense, pluralization, etc. For names/titles you might actually want <strong>both</strong> indexing strategies to maximize the chances of the search properly capturing user intent.</p>
<p>This feels like a product feature question. Do we continue to support full text search on certain columns (at the cost of slightly greater implementation complexity), or switch exclusively to substring search?</p>
<blockquote>
<p>The current (FT) search implementation interprets "foo bar" as "foo & bar". Does this translate to <a class="wiki-page new" href="https://dev.arvados.org/projects/arvados/wiki/%22any%22%22like%22%22%25foo_bar%25%22">"any","like","%foo bar%"</a> here, or <a class="wiki-page new" href="https://dev.arvados.org/projects/arvados/wiki/%22any%22%22like%22%22%25foo%25%22%5D%5B%22any%22%22like%22%22%25bar%25%22">"any","like","%foo%"],["any","like","%bar%"</a> -- and would those queries perform well too?</p>
</blockquote>
<p>The splitting and joining with '&' happens in Ruby. So instead of turning "foo bar" into "foo & bar" it would have turn it into <a class="wiki-page new" href="https://dev.arvados.org/projects/arvados/wiki/%22any%22%22like%22%22%25foo%25%22%5D%5B%22any%22%22like%22%22%25bar%25%22">"any","like","%foo%"],["any","like","%bar%"</a> to get the same results.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733342019-04-16T19:03:36ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>The splitting and joining with '&' happens in Ruby. So instead of turning "foo bar" into "foo & bar" it would have turn it into <a class="wiki-page new" href="https://dev.arvados.org/projects/arvados/wiki/%22any%22%22like%22%22%25foo%25%22%5D%5B%22any%22%22like%22%22%25bar%25%22">"any","like","%foo%"],["any","like","%bar%"</a> to get the same results.</p>
</blockquote>
<p>Currently we implement <code>filters=[["any","@@","foo bar"]]</code> by doing <code>where [...] @@ 'foo & bar'</code>.</p>
<p>The trigram docs have examples like <code>where t like '%foo%bar'</code> and <code>where t ~ '(foo|bar)'</code> ... so I thought maybe one of those could avoid expanding to <code>where ... like '%foo%' and ... like '%bar%'</code>. Either way I thought we might want to confirm performance & results are still promising when the query has more than one word.</p> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733352019-04-16T19:20:30ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>Currently we implement <code>filters=[["any","@@","foo bar"]]</code> by doing <code>where [...] @@ 'foo & bar'</code>.</p>
</blockquote>
<p>Specifically:</p>
<pre>
# Use to_tsquery since plainto_tsquery does not support prefix
# search. And, split operand and join the words with ' & '
cond_out << model_class.full_text_tsvector+" @@ to_tsquery(?)"
param_out << operand.split.join(' & ')
</pre>
<p>My point is just that splitting and inserting ' & ' happens in the API server not the client, so we don't have to interpret '&' it is just an implementation detail (I think we're in violent agreement).</p>
<blockquote>
<p>The trigram docs have examples like <code>where t like '%foo%bar'</code> and <code>where t ~ '(foo|bar)'</code> ... so I thought maybe one of those could avoid expanding to <code>where ... like '%foo%' and ... like '%bar%'</code>. Either way I thought we might want to confirm performance & results are still promising when the query has more than one word.</p>
</blockquote>
<pre>
select uuid from collections where file_names ilike '%foo%bar%' or file_names ilike '%bar%foo%';
35.978 ms
select uuid from collections where file_names ilike '%foo%' and file_names ilike '%bar%';
Time: 56.192 ms
select uuid from collections where file_names ~ '(foo|bar)';
Time: 1208.119 ms
</pre>
<p>'~' is the regular expression operator.</p>
<p>Whoops, that last one is different, it is "contains foo OR bar" which is not the same.</p>
<p>However I don't think regular expressions have any advantage over ilike, we still end up having to do something like this:</p>
<pre>
select uuid from collections where file_names ~ 'foo.*bar' or file_names ~ 'bar.*foo';
Time: 34.146 ms
</pre> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733362019-04-16T19:27:48ZTom Cleggtom@curii.com
<ul></ul><p>OK. I was just thinking that if a trigram index works by counting common trigrams, these queries might all return the same results.</p>
<pre><code>select uuid from collections where file_names ilike '%foo%bar%';<br />select uuid from collections where file_names ilike '%bar%foo%';<br />select uuid from collections where file_names ilike '%foo%bar%' or file_names ilike '%bar%foo%';</code></pre> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733372019-04-16T19:38:06ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>OK. I was just thinking that if a trigram index works by counting common trigrams, these queries might all return the same results.</p>
<p>select uuid from collections where file_names ilike '%foo%bar%';<br />select uuid from collections where file_names ilike '%bar%foo%';<br />select uuid from collections where file_names ilike '%foo%bar%' or file_names ilike '%bar%foo%';</p>
</blockquote>
<p>No? If the 1st and 2nd statements returned the same results that would violate the semantics of 'ilike'.</p>
<p>There are "near match" operators for trigram but we haven't been talking about those.</p>
<p>The following statements return the same results, however the 1st one is slightly more efficient (at least in this case). But searching for all permutations of term ordering could get ugly when there's more than 3 or 4 terms.</p>
<pre>
select uuid from collections where file_names ilike '%foo%bar%' or file_names ilike '%bar%foo%';
35.978 ms
select uuid from collections where file_names ilike '%foo%' and file_names ilike '%bar%';
Time: 56.192 ms
</pre>
<p>I'm speculating but I believe the index works something like:</p>
<ol>
<li>break the search term into the set of trigrams</li>
<li>search for each trigram in the index to get a set of potential matches</li>
<li>perform the exact match test on each candidate match to confirm or reject it</li>
</ol> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=733562019-04-17T15:27:02ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-3 priority-4 priority-default closed parent" href="/issues/15106">Feature #15106</a>: [API] Index 'like' queries and use for search</i> added</li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=734742019-04-19T19:45:08ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul> Arvados - Feature #14573: [Spike] [API] Fully functional filename searchhttps://dev.arvados.org/issues/14573?journal_id=746612019-05-21T22:27:37ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Release</strong> set to <i>15</i></li></ul>