Search is Cool!

Wow! Its been a long time… I apologize for the hidas, I’ve been swamped with graduating college, starting a new job and chasing an ever increasingly mobile baby. But, enough of that. Today I’d like to talk about Search.

Over the last few months, I’ve found myself doing more and more database programming. One thing I was challenged with was the development of a search engine for a very large project I’m involved in at work. Having read on the subject before, I new immediately that I wanted to have an indexed search engine that used a porter stemming algorithm to stem words down to their natural form and then compare the stemmed version of each word in the query against a table that mapped words to their record ids (more information on the porter stemmer algorithm).

Long story short, I managed to make a pretty effective search engine built entirely using T-SQL. Thankfully, the most time consuming part of the task was done for me by the contributors of a T-SQL version of the porter algorithm at tartarus.org.

After including the UDF’s (user-defined functions) provided via tartarus.org, I needed to index my content using the algorithm. This was fairly simple. I parsed each word in the record set I wanted to search and assigned each stemmed word an id and stored the stemmed word and its id in a table. Next I created a mapping table that contained the record id and the word id.

Of course there are several words you don’t want to consider in the search. I believe the official term for these words are “stop words.” I referred to them as “drop words” when building this little baby. I created a table that contained words I want to ignore, like “I,” “me,” “they,” etc. So, when indexing or searching any word appearing in my drop words table are excluded.

Now my index is in place. I need to search it. I created a stored procedure that parses a search phrase and compares each stemmed word to the word index and constructed a tally of hits each word received. Once all words were parsed, I count up the hits and return the results.

So far I have 5 successful implementations of this search engine. I’ve modified it to include weighting criteria in one of those implementations and I also want to implement a word position constraint to match exact phrases to records and give them greater weight than simple hits, but that is for the future. My intention is to port this to MySQL 5, but that will be a little work. Once ported, I’ll probably show off some of the code.

Peace out…