Ayase Quart aka AQ has search plugins, and I wrote one to search for images. Images are first auto-tagged, and stored in a sqlite database. Here’s what we have.
I’ve put in a few days worth of effort into getting faster image search results at https://ayasequart.org/fts. Previously, the whole server-side request-response process (which includes page rendering) was always taking 0.1 - 1.2s, with image search being nearly all of that. Now it’s always below 0.05s.
Search results by themselves have been reduced to ~0.005s. That’s 100x faster!
The secret sauce relied on many ingredients. A better schema, proper indicies, but also a better query structure. I was originally doing,
-- GROUP-BY
select image.num
from image
join image_tag using (image_id)
where image.board = 'g'
and image.general >= 0.1
and image_tag.tag_id in (9173,4)
and image_tag.prob >= 0.4
group by image.image_id
having count(*) = 2
limit 250;
-- QUERY PLAN
-- |--SEARCH image_tag USING COVERING INDEX idx_image_tag_tag_prob_image (tag_id=? AND prob>?)
-- |--SEARCH image USING INTEGER PRIMARY KEY (rowid=?)
-- |--USE TEMP B-TREE FOR GROUP BY
USE TEMP B-TREE FOR GROUP BY
led to relatively bad
performance.
I searched StackOverflow and came across a few query
structures for this problem. The best structure I tested used N
joins, given N tag ids. This was 6) Sean
’s solution in the
linked SA answer.
-- MULTI-JOIN
select image.num
from image
join image_tag t0 using (image_id)
join image_tag t1 using (image_id)
where image.board = 'g'
and image.general >= 0.1
and t0.tag_id = 9173 and t0.prob >= 0.4
and t1.tag_id = 4 and t1.prob >= 0.4
limit 250;
-- QUERY PLAN
-- |--SEARCH t0 USING COVERING INDEX idx_image_tag_tag_prob_image (tag_id=? AND prob>?)
-- |--SEARCH image USING INTEGER PRIMARY KEY (rowid=?)
-- |--SEARCH t1 USING INDEX sqlite_autoindex_image_tag_1 (image_id=? AND tag_id=?)
Extra performance was also obtained by joining less common tags
first. As you’ll see, tag frequency has a large impact on performance.
There is a large difference between tags that appear in 0, 1, 10k and
100k images because of the
idx_image_tag_tag_prob_image (tag_id, prob)
index. I’m not
sure if there is way to make this faster.
The second best structure was 4) Derek
,
-- IMAGE-IN
select image.num
from image
where board = 'g'
and general >= 0.1
and image_id in (select image_id from image_tag where tag_id = 9173 and prob >= 0.4)
and image_id in (select image_id from image_tag where tag_id = 4 and prob >= 0.4)
limit 250;
-- QUERY PLAN
-- |--SEARCH image USING INTEGER PRIMARY KEY (rowid=?)
-- |--LIST SUBQUERY 1
-- | --SEARCH image_tag USING COVERING INDEX idx_image_tag_tag_prob_image (tag_id=? AND prob>?)
-- |--LIST SUBQUERY 2
-- | --SEARCH image_tag USING COVERING INDEX idx_image_tag_tag_prob_image (tag_id=? AND prob>?)
The worst structure, 5) Erwin 2
, used exists statments
in the where clause,
-- EXISTS
select image.num
from image
where image.board = 'g'
and image.general >= 0.1
and image.num is not null
and exists (
select 1
from image_tag it
where it.image_id = image.image_id
and it.tag_id = 4
and it.prob >= 0.4
)and exists (
select 1
from image_tag it
where it.image_id = image.image_id
and it.tag_id = 9173
and it.prob >= 0.4
)limit 250;
-- QUERY PLAN
-- |--SEARCH image USING INDEX idx_image_board_general_image_id (board=? AND general>?)
-- |--CORRELATED SCALAR SUBQUERY 1
-- | --SEARCH it USING INDEX sqlite_autoindex_image_tag_1 (image_id=? AND tag_id=?)
-- |--CORRELATED SCALAR SUBQUERY 2
-- | --SEARCH it USING INDEX sqlite_autoindex_image_tag_1 (image_id=? AND tag_id=?)
Here are the results for the queries above. 1girl
, tag
id 4, is the most common tag in my database, appearing in 184k images,
meanwhile tohru_(maidragon)
, tag id 9173, only appears in
540 images.
group_by sum_of_all_min_times=0.1303 wins=0
join sum_of_all_min_times=0.0011 wins=1
exists sum_of_all_min_times=0.1747 wins=0
image_in sum_of_all_min_times=0.0516 wins=0
Here are the other tag combinations. ‘high’, ‘mid’, and ‘low’ crudely refer to a tags frequency in the db. The rightmost column is the number of results obtained from the query.
per_page=250
tag_prob=0.4
rating_general=0.1
board='g'
1high_1mid group_by tag_count=2 min=0.1294 avg=0.1327 max=0.1414 225
1high_1mid join tag_count=2 min=0.001 avg=0.0011 max=0.0012 225
1high_1mid exists tag_count=2 min=0.1725 avg=0.1742 max=0.1754 225
1high_1mid image_in tag_count=2 min=0.0506 avg=0.0516 max=0.053 225
3high group_by tag_count=3 min=0.3625 avg=0.3656 max=0.3688 98
3high join tag_count=3 min=0.0011 avg=0.0012 max=0.0014 98
3high exists tag_count=3 min=0.174 avg=0.1755 max=0.1777 98
3high image_in tag_count=3 min=0.1372 avg=0.1376 max=0.1385 98
1mid group_by tag_count=3 min=0.3585 avg=0.3602 max=0.3625 98
1mid join tag_count=3 min=0.0012 avg=0.0012 max=0.0013 98
1mid exists tag_count=3 min=0.1751 avg=0.1769 max=0.1799 98
1mid image_in tag_count=3 min=0.1376 avg=0.1391 max=0.1428 98
1low group_by tag_count=1 min=0.0 avg=0.0 max=0.0001 3
1low join tag_count=1 min=0.0 avg=0.0 max=0.0 3
1low exists tag_count=1 min=0.1742 avg=0.1772 max=0.1819 3
1low image_in tag_count=1 min=0.0 avg=0.0 max=0.0001 3
2low group_by tag_count=2 min=0.0001 avg=0.0002 max=0.0006 0
2low join tag_count=2 min=0.0001 avg=0.0001 max=0.0002 0
2low exists tag_count=2 min=0.1707 avg=0.1724 max=0.1733 0
2low image_in tag_count=2 min=0.0001 avg=0.0002 max=0.0004 0
1high_1low group_by tag_count=2 min=0.2339 avg=0.2344 max=0.2357 24
1high_1low join tag_count=2 min=0.0 avg=0.0001 max=0.0003 24
1high_1low exists tag_count=2 min=0.1707 avg=0.1741 max=0.1765 24
1high_1low image_in tag_count=2 min=0.0825 avg=0.0832 max=0.0839 24
2high_1mid group_by tag_count=3 min=0.3619 avg=0.3636 max=0.366 250
2high_1mid join tag_count=3 min=0.0028 avg=0.003 max=0.0033 250
2high_1mid exists tag_count=3 min=0.1289 avg=0.1293 max=0.1299 250
2high_1mid image_in tag_count=3 min=0.1362 avg=0.1373 max=0.138 250
1high_1mid_1low group_by tag_count=3 min=0.1297 avg=0.1309 max=0.1318 0
1high_1mid_1low join tag_count=3 min=0.0 avg=0.0001 max=0.0002 0
1high_1mid_1low exists tag_count=3 min=0.172 avg=0.1733 max=0.1745 0
1high_1mid_1low image_in tag_count=3 min=0.0004 avg=0.0004 max=0.0006 0
2high group_by tag_count=2 min=0.3421 avg=0.3465 max=0.3498 250
2high join tag_count=2 min=0.0022 avg=0.0023 max=0.0024 250
2high exists tag_count=2 min=0.003 avg=0.0032 max=0.0036 250
2high image_in tag_count=2 min=0.1259 avg=0.1283 max=0.1294 250
4high group_by tag_count=4 min=0.5133 avg=0.5143 max=0.5154 250
4high join tag_count=4 min=0.0287 avg=0.0294 max=0.0304 250
4high exists tag_count=4 min=0.021 avg=0.0216 max=0.0225 250
4high image_in tag_count=4 min=0.1912 avg=0.1916 max=0.1926 250
6high group_by tag_count=6 min=0.7655 avg=0.7737 max=0.7797 250
6high join tag_count=6 min=0.1519 avg=0.1574 max=0.1629 250
6high exists tag_count=6 min=0.2394 avg=0.2402 max=0.2418 250
6high image_in tag_count=6 min=0.2833 avg=0.2861 max=0.2903 250
14high group_by tag_count=14 min=1.2962 avg=1.3005 max=1.305 3
14high join tag_count=14 min=0.1844 avg=0.1855 max=0.1878 3
14high exists tag_count=14 min=0.264 avg=0.265 max=0.2659 3
14high image_in tag_count=14 min=0.4708 avg=0.473 max=0.4741 3
The multi-join query structure is clearly the winner across a variety of tag searches being made.
group_by sum_of_all_min_times=4.4931 wins=2
join sum_of_all_min_times=0.3734 wins=9
exists sum_of_all_min_times=1.8655 wins=1
image_in sum_of_all_min_times=1.6158 wins=0
Note that sqlite has a limit of 64 joins.