
Hi, On Apr 19, 2008, at 15:51 , Oliver Tappe wrote:
One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.
I tried this for MySQL and DB2. The problem on both is they only create indexes on words (so far as I figured out). So I was running into not finding ticket#... so I tried other ideas... but maybe I'm wrong. Of course, they also would support other cool features like "sounds like"... :)
c) Create Index of frequent search words ---------------------------------------- This idea is to put all frequently searched words in an extra index. And do not frequently search words in real time.
I suppose the quality and speed of this strategy depends a lot on the meaning of "frequently searched words". How did you determine these?
At this point I was not determine this, I only took some words to find out what performance i get. ;)
So in my opinion for small setups b) would work very fine but would not scale to lager installations.
Hm, this surprises me, since being able to use an index should really speed up the search on larger datasets, not slow it down - as is indicated by the numbers given above.
c) looks good for large installations till not indexed words are searched (e. g. searches for CVE oder other numbers would always use no indexed search). a) looks robust to both, large and small installations.
From an algorithmic perspective, a) should be no considerable improvement over the original (current) searching strategy. Both do not involve any index, so they will correlate linearly to the size of the dataset - O(n) [and get much worse if the dataset fails to fit into memory]. Making use of an index as in c) and d) should improve that correlation to O(log n).
Don't get me wrong, I'm not saying you're numbers are wrong, but I am merely struggling to find out why these numbers are the way they are.
Any other ideas or questions?
Yes, I do have a couple of questions:
- What were the exact SQL-statements you used to yield those numbers?
May be this is the main problem. A ticket search means I need to search over different tables and attributes. So in this case I need to involve the ticket, article, queue table. I also need to keep the permissions on ticket/queues in mind. A query looks like: SELECT DISTINCT st.id, st.tn, st.create_time_unix FROM ticket st, queue sq , article at WHERE sq.id = st.queue_id AND st.id = at.ticket_id AND sq.group_id IN (1, 2, 240, 3, 31, 312, 32) AND ( LOWER(at.a_body) LIKE LOWER('%smith%')) ORDER BY st.create_time_unix DESC So the problem is, that I only need to search article where I have ro access.
- In strategy b), there was an index on search_words.word, right? ;-)
Yes. :)
- The two datasets you mentioned - did you create them artificially or are they copies of real setups (i.e. is the content "real" or computer generated)?
Yes. Data from live databases. -Martin -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818 0 :: Fax: +49 (0) 9421 56818 18 http://www.otrs.com/ :: Communication with success! Address of record: Bad Homburg Local Court: Bad Homburg, HRB 10751 Tax number: 003/240/97505 Chairman: Burchard Steinbild Chief Executive Officer: André Mindermann