Re: [dev] Ideas about Fulltext Searching (performance)?

19 Apr 2008

      Hi,

On 2008-04-19 at 13:29:04 [+0200], Martin Edenhofer  wrote:
...
today I was thinking about improvements for fulltext search (special
body of article) which I want to discuss here. :)
Current situation / Use case:
o OTRS 2.x
o Intel Core2 2.40GHz
A fulltext search on body with "smith" takes about 15 sek on large
installations (done by normal LIKE search).
Has somebody ideas how to improve this?
One idea would be to make use of any fulltext indexing capability that the 
DBMS provides. I have no idea if all supported DBMS provide such a feature 
but I do know that mysql, oracle and mssql have it.
...
I do have 3 ideas.
a) Striping/Unique Content
--------------------------
The idea is quite easy. Just creating a mirror article table (e. g.
article_search) and store all bodies of the article table with unique
words of one article in the article_search table. So OTRS can use this
table 1:1 for fulltext searches.
This setup reduces the size of the article table, but it will not scale well 
with large datasets, since searching will still involve a full table scan of 
the article table.
So while being a bit faster than searching the real article table, this 
strategy makes it impossible to search for word phrases like "%ticket 
search%" (though I do not know if OTRS currently supports that at all).
...
b) Word-Table/Ref-Word-Article-Table
------------------------------------
This idea is to put all used words in article body into a search_word
(id,word) table and put the relations what words are used in the
article body into a search_word_article (word_id,article_id).
So OTRS can use this table for fulltext searches.
This should speed up searching through large datasets considerably, 
especially if the search phrases are anchored at the beginning ('content' or 
'conte%', but not '%ntent'), since the search can then use the index and skip 
the full table scan altogether.
AFAICS, the DBMS-implementations of fulltext indices more or less use this 
approach and optimize it by maintaing a list of stop words (words considered 
irrelevant for search results) that are left out from indices and dropped 
from the given list of search keys.
So if we'd add stop words to that strategy, I think it should improve the 
searching speed a lot.
...
c) Create Index of frequent search words
----------------------------------------
This idea is to put all frequently searched words in an extra index.
And do not frequently search words in real time.
I suppose the quality and speed of this strategy depends a lot on the meaning 
of "frequently searched words". How did you determine these?
...
Result with about 12.000 tickets and 50.000 article:
|    article   |a) article     | b)search_words |
c)dedicated
             |   (default)  |   mirror uniq |  and ref table |
search_word
             |              |   words       |                |  index
------------+--------------+---------------+----------------
+---------------
Storage Size|      60M     |       16M     |       23M      |  8M/
2000words
------------+--------------+---------------+----------------
+---------------
Search Time |       4s     |        2s     |        1s      |
Index     1s
             |              |               |                | No
Index  4s
------------+--------------+---------------+----------------
+---------------
Time to     |    0 min     |    30 min     |     2.5 h      |      1 h
create Index|              |               |                |
------------+--------------+---------------+----------------
+---------------
Result with about 70.000 tickets and 240.000 article:
|    article   |a) article     | b)search_words |
c)dedicated
             |   (default)  |   mirror uniq |  and ref table |
search_words
             |              |   words       |                |
------------+--------------+---------------+----------------
+---------------
Storage Size|     440M     |      121M     |      394M      | 50M/
2000words
------------+--------------+---------------+----------------
+---------------
Search Time |      17s     |      5.8s     |       48s      | Index
3.0s
             |              |               |                | No
Index 17s
------------+--------------+---------------+----------------
+---------------
Time to     |    0 min     |       1 h     |       9 h      |      4 h
create Index|              |               |                |
------------+--------------+---------------+----------------
+---------------
So in my opinion for small setups b) would work very fine but would
not scale to lager installations.
Hm, this surprises me, since being able to use an index should really speed 
up the search on larger datasets, not slow it down - as is indicated by the 
numbers given above.
...
c) looks good for large installations till not indexed words are
searched (e. g. searches for CVE oder other numbers would always use
no indexed search). a) looks robust to both, large and small
installations.
...
From an algorithmic perspective, a) should be no considerable improvement 
over the original (current) searching strategy. Both do not involve any 
index, so they will correlate linearly to the size of the dataset - O(n) [and 
get much worse if the dataset fails to fit into memory]. Making use of an 
index as in c) and d) should improve that correlation to O(log n).
Don't get me wrong, I'm not saying you're numbers are wrong, but I am merely 
struggling to find out why these numbers are the way they are.
...
Any other ideas or questions?
Yes, I do have a couple of questions:

- What were the exact SQL-statements you used to yield those numbers?

- In strategy b), there was an index on search_words.word, right? ;-)

- The two datasets you mentioned - did you create them artificially or are 
  they copies of real setups (i.e. is the content "real" or computer
  generated)?

So far for now.

cheers,
    Oliver

-- 
((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing
  Fon: +49 (0) 9421 56818-0  :: Fax: +49 (0) 9421 56818-18
    http://www.otrs.com/ :: Communication with success!

Re: [dev] Ideas about Fulltext Searching (performance)?

Oliver Tappe