
Hi,
On 2008-04-19 at 13:29:04 [+0200], Martin Edenhofer
today I was thinking about improvements for fulltext search (special body of article) which I want to discuss here. :)
Current situation / Use case:
o OTRS 2.x o Intel Core2 2.40GHz
A fulltext search on body with "smith" takes about 15 sek on large installations (done by normal LIKE search).
Has somebody ideas how to improve this?
One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.
I do have 3 ideas.
a) Striping/Unique Content -------------------------- The idea is quite easy. Just creating a mirror article table (e. g. article_search) and store all bodies of the article table with unique words of one article in the article_search table. So OTRS can use this table 1:1 for fulltext searches.
This setup reduces the size of the article table, but it will not scale well with large datasets, since searching will still involve a full table scan of the article table. So while being a bit faster than searching the real article table, this strategy makes it impossible to search for word phrases like "%ticket search%" (though I do not know if OTRS currently supports that at all).
b) Word-Table/Ref-Word-Article-Table ------------------------------------ This idea is to put all used words in article body into a search_word (id,word) table and put the relations what words are used in the article body into a search_word_article (word_id,article_id). So OTRS can use this table for fulltext searches.
This should speed up searching through large datasets considerably, especially if the search phrases are anchored at the beginning ('content' or 'conte%', but not '%ntent'), since the search can then use the index and skip the full table scan altogether. AFAICS, the DBMS-implementations of fulltext indices more or less use this approach and optimize it by maintaing a list of stop words (words considered irrelevant for search results) that are left out from indices and dropped from the given list of search keys. So if we'd add stop words to that strategy, I think it should improve the searching speed a lot.
c) Create Index of frequent search words ---------------------------------------- This idea is to put all frequently searched words in an extra index. And do not frequently search words in real time.
I suppose the quality and speed of this strategy depends a lot on the meaning of "frequently searched words". How did you determine these?
Result with about 12.000 tickets and 50.000 article:
| article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_word | | words | | index ------------+--------------+---------------+---------------- +--------------- Storage Size| 60M | 16M | 23M | 8M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 4s | 2s | 1s | Index 1s | | | | No Index 4s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 30 min | 2.5 h | 1 h create Index| | | | ------------+--------------+---------------+---------------- +---------------
Result with about 70.000 tickets and 240.000 article:
| article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_words | | words | | ------------+--------------+---------------+---------------- +--------------- Storage Size| 440M | 121M | 394M | 50M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 17s | 5.8s | 48s | Index 3.0s | | | | No Index 17s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 1 h | 9 h | 4 h create Index| | | | ------------+--------------+---------------+---------------- +---------------
So in my opinion for small setups b) would work very fine but would not scale to lager installations.
Hm, this surprises me, since being able to use an index should really speed up the search on larger datasets, not slow it down - as is indicated by the numbers given above.
c) looks good for large installations till not indexed words are searched (e. g. searches for CVE oder other numbers would always use no indexed search). a) looks robust to both, large and small installations.
From an algorithmic perspective, a) should be no considerable improvement over the original (current) searching strategy. Both do not involve any index, so they will correlate linearly to the size of the dataset - O(n) [and get much worse if the dataset fails to fit into memory]. Making use of an index as in c) and d) should improve that correlation to O(log n).
Don't get me wrong, I'm not saying you're numbers are wrong, but I am merely struggling to find out why these numbers are the way they are.
Any other ideas or questions?
Yes, I do have a couple of questions: - What were the exact SQL-statements you used to yield those numbers? - In strategy b), there was an index on search_words.word, right? ;-) - The two datasets you mentioned - did you create them artificially or are they copies of real setups (i.e. is the content "real" or computer generated)? So far for now. cheers, Oliver -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818-0 :: Fax: +49 (0) 9421 56818-18 http://www.otrs.com/ :: Communication with success!