Re: [dev] Ideas about Fulltext Searching (performance)?

19 Apr 2008

      Hi,

On Apr 19, 2008, at 15:51 , Oliver Tappe wrote:
...
One idea would be to make use of any fulltext indexing capability  
that the
DBMS provides. I have no idea if all supported DBMS provide such a  
feature
but I do know that mysql, oracle and mssql have it.
I tried this for MySQL and DB2. The problem on both is they only  
create indexes on words (so far as I figured out). So I was running  
into not finding ticket#... so I tried other ideas... but maybe I'm  
wrong.

Of course, they also would support other cool features like "sounds  
like"... :)
...
...
c) Create Index of frequent search words
----------------------------------------
This idea is to put all frequently searched words in an extra index.
And do not frequently search words in real time.
I suppose the quality and speed of this strategy depends a lot on  
the meaning
of "frequently searched words". How did you determine these?
At this point I was not determine this, I only took some words to find  
out what performance i get. ;)
...
...
So in my opinion for small setups b) would work very fine but would
not scale to lager installations.
Hm, this surprises me, since being able to use an index should  
really speed
up the search on larger datasets, not slow it down - as is indicated  
by the
numbers given above.
...
c) looks good for large installations till not indexed words are
searched (e. g. searches for CVE oder other numbers would always use
no indexed search). a) looks robust to both, large and small
installations.
...
From an algorithmic perspective, a) should be no considerable  
improvement
over the original (current) searching strategy. Both do not involve  
any
index, so they will correlate linearly to the size of the dataset -  
O(n) [and
get much worse if the dataset fails to fit into memory]. Making use  
of an
index as in c) and d) should improve that correlation to O(log n).
Don't get me wrong, I'm not saying you're numbers are wrong, but I  
am merely
struggling to find out why these numbers are the way they are.
...
Any other ideas or questions?
Yes, I do have a couple of questions:
- What were the exact SQL-statements you used to yield those numbers?
May be this is the main problem. A ticket search means I need to  
search over different tables and attributes. So in this case I need to  
involve the ticket, article, queue table. I also need to keep the  
permissions on ticket/queues in mind.

A query looks like:

SELECT DISTINCT st.id, st.tn, st.create_time_unix FROM ticket st,  
queue sq , article at WHERE sq.id = st.queue_id AND st.id =  
at.ticket_id AND sq.group_id IN (1, 2, 240, 3, 31, 312, 32)  AND  
( LOWER(at.a_body) LIKE LOWER('%smith%')) ORDER BY st.create_time_unix  
DESC

So the problem is, that I only need to search article where I have ro  
access.
...
- In strategy b), there was an index on search_words.word, right? ;-)
Yes. :)
...
- The two datasets you mentioned - did you create them artificially  
or are
 they copies of real setups (i.e. is the content "real" or computer
 generated)?
Yes. Data from live databases.

  -Martin

-- 
((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing
   Fon: +49 (0) 9421 56818 0 :: Fax: +49 (0) 9421 56818 18
     http://www.otrs.com/ :: Communication with success!

Address of record: Bad Homburg
Local Court: Bad Homburg, HRB 10751
Tax number: 003/240/97505

Chairman: Burchard Steinbild
Chief Executive Officer: André Mindermann

Re: [dev] Ideas about Fulltext Searching (performance)?

Martin Edenhofer