4

Im making a (small) site in php & mysql. The mysql database consists of a single table with possibly tens of thousands of rows (in the future).

I only need to get results from one column which has a character limit of 200.

Considering that I don't want any stopwords or "common" word features; and the column contains filenames (some with special characters), what type of search would you recommend?

EDIT: Just some more clarification on filenames:
I don't want to do an exact match; say I have a name like this: [abc]_random_file.txt; searching for "abc random" OR "[abc] random" OR "abc txt" should all match. Hope that makes sense.

I've considered:

1----SQL like statement
2----SQL regexps (as shown on this website: http://www.iamcal.com/publish/articles/php/search/ )
3----Mysql Full-Text Search(MyISAM)
4----Third party search engines(really don't want to do this)

With #2 I can probably get the results I want, however I don't think it would work well with my table if it gets big?

Would appreciate any help; im a real beginner to all this and I've been googling all day :(

2
  • You don't say what kind of queries you want to do. If the column contains filenames then are you doing an exact match, i.e. select * from table where filename = '/tmp/flfl.txt'? Do you want to return partial name matches on the filenames? Commented Jun 29, 2011 at 11:25
  • Oh my bad. Nah I don't want to do an exact match, otherwise #1 would have worked for me lol. say I have a name like this: [abc] random file.txt; searching for "abc random" OR "[abc] random" OR "abc txt" should all match. Hope that makes more sense - will edit main post Commented Jun 29, 2011 at 11:28

3 Answers 3

1

What do you need to search? Optimizations can often be found when you take in account the limitations of your searches. For instance, if you only need to find all the rows that start with a specific text, you can add a simple index and use LIKE, which will give instant results.

In general, if a simple equality or LIKE will do, those will probably perform best. A regexp will anyway do a full scan. However even if there were a 100,000 rows, it would still take up mere 20MB in RAM, so a full scan through all that will not be very slow (unless you're on an ancient server).

I'd say - try the simple approach first and see what happens. If the performance reaches unacceptable levels, you should be easily able to change your approach. After all - you did state that the website was small.

Added: Just read your update. Sounds like a job for MySQL's fulltext index. Try it to see if it works for you.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the helpful reply, I had no idea that even 100,000 rows would take up only 20MB of ram. I'm sure my site will never get that many rows. Anyway, i'll probably mass add thousands of rows for testing, and see if regexp is viable. Regarding Fulltext index, do you know of a way to avoid the 50% threshold for common words?
@lelouch - Do the math yourself: 200 bytes times 100,000 rows equals 20MB. Well, ok, so there will be some overhead, plus other data, etc, etc. but in the end it's still well within the memory limits of today's computers; plus the amount of data your query needs to concern itself with is 20MB. I'm absolutely certain that MySQL will keep it cached in RAM at all times.
@lelouch - as for the 50% threshold - perhaps you need the boolean mode? Btw that works even without index, albeit slowly. You can also fine tune other aspects such as stopwords and minimum word length.
Oh wow, just had a look at the boolean mode, it seems to 90% do what I want, I can probably modify it to do the other 10%. I'll go with this, thanks! PS: Any reason i would want to NOT use an index for this boolean mode?
@lelouch - trying it out? Otherwise no, I can't think of a reason. Although... here's some food for thought - if the file is named myvacationplan.doc - would you want it to be found for the keyword vacation? If so, then I'm afraid no fulltext search will be right for you. :(
|
1
  1. SQL LIKE does not scale(Flickr also concluded)
  2. I am also affraid LIKE will fail on load.
  3. I think you should try to use this. I think this page might help you achieve your goal => http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html

'apple*'

Find rows that contain words such as “apple”, “apples”, “applesauce”, or “applet”.

  1. Even better would be 3rd-party.

I don't understand the reason for your rejection. I would probably use something third-party because no big site uses MySQL FULL TEXT for there search:


http://www.quora.com/Why-dont-any-of-the-popular-NoSQL-solutions-include-full-text-search

As someone who deploys and administers both full-text search and NoSQL servers for a good sized web site, here's my 2 cents:

Full-Text search is not an easy problem. It's fairly well understood, but it takes a fair amount of engineering effort and specialized knowledge to do really, really well and get all the details right. And nobody wants a half-assed search system built into their product--or one with weird limitations.

MySQL had full-text search that wasn't bad, but it was really only implemented in the MyISAM storage engine. But all the cool kids use InnoDB these days, so it's mostly useless or requires duplication of data. Yuck.

Personally, I'm a fan of Sphinx, which can be built into MySQL or used stand-alone, which is how we use it for Craigslist. We're also readying a MongoDB deployment that will be paired with Sphinx for full-text search. So you might think that I can't wait until 10gen implements full-text search into MongoDB, right?

Not really.

I subscribe to the Unix philosophy of having smaller tools that do fewer things but do them really well. What I've suggested to Dwight (and maybe Eliot too?) is that MongoDB should have a standard interface for plugging in various full-text search solutions. That is, it needs a way to notify external tools of new/changed/deleted documents so they can be handled appropriately. (Given MongoDB's oplog, this seems doable at some level.)

Now I know full well that Riak has an answer already. And I know that 10gen will build it into MongoDB anyway. As a provider of technology, they feel strongly that folks want a "complete" solution that works out of the box for a lot of people. But it (full-text) is never going to be the sort of thing that ships in version 1.0 of these products either. It's simply not a "core" feature.

It's also worth noting that from a performance and tuning point of view, search is different that document (or record) retrieval, which can complicate deployment matters too.

2 Comments

Hm. I'm actually on shared hosting right now (that's how small my site is :p); and for third party search I'd have to get a vps. Right? That's part of my reason for avoiding it.
@lelouch Aha :). But when you get big shared hosting providers aren't the way to go because you don't have any resources you can rely on. But in the beginning I would try to rely on MySQL boolean FULL-TEXT search if available.
0

Did you tried Sphinx? refer this :http://www.ibm.com/developerworks/library/os-php-sphinxsearch/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.