среда, мая 21, 2008

Sphinx FAIL

On my recent Rails project I decided to try Sphinx search engine. Before that I used Ferret and then - SOLR. I abandoned Ferret because it's instability and lack of tools to track problems (like Luke - index browser for Lucene).

So, Sphinx. I have found two plugins for Rails - acts_as_sphinx and Unltrasphinx. Ok, I've heard in mailing lists that Ultrasphinx is better (and Sphinx recipe in Advanced Rails Recipes book also involves it) so I decided to use it.

First I have to install Sphinx from sources because version in MacPorts (although, latest released version) is too old and Ultrasphinx requiresa newer one (at that point - release candidate 2 of next revision, which is 0.9.8rc2 vs 0.9.7). Then I had to do various dances around Sphinx to compile it on my Mac OS X (which are described in my previous post.).

And then it began:

1. It was needed that new data (e.g. new articles) could be searched with Sphinx right after it was added. And then I find out, that it is not encouraged to do such updates often, you'd better run full updates once a day... Wtf? I have heard that there are something called "deltas", but as I found out from plugin, it doesn't install any hooks on models, so I assume that deltas need to be built periodically (which is also unacceptable).
After consulting with other people who have used Sphinx I found out that:
1) they don't use plugins and do all communication with some low level library (Riddle, as far as I remember) manually.
2) they install their own hooks on models and call indexer manually to reindex their models. I have tried to to install after_save hook but it is run BEFORE transaction is commited and indexer can't see inserted/updated data, so I don't know how this can be accomplished easily.

2. Sphinx configuration is one big scary thing. Although Ultrasphinx managed to build it for me, I needed to do some tweaks to it and then next time I needed to build configuration for another model my tweaks were lost.

3. Model on which I have used Ultrasphinx (called ::is_indexed) failed to load by automatic dependency loader. After several hours of tracking this problem I stuck "require 'my_model'" into environment.rb (which helped) and cursed that Sphinx and all it's plugins.

So, for me Sphinx definitely FAIL.

PS
I have tried SOLR and it worked like charm:
1) almost no configuration.
SOLR configuration consists of type definitions (which describe how value of that type should be analyzed), field definitions and dynamic field definitions (acts_as_solr rails plugin uses only dynamic fields). Acts_as_solr comes with default SOLR configuration which contains commonly used types and dynamic field definitions for them.
2) no compilation needed. Acts_as_solr comes with SOLR JAR files, so you just need proper version of Java runtime installed.
3) works like charm. Everything you would expect from full-text search engine.

5 комментариев:

Unknown комментирует...

Most if not all of the issues that you mentioned were related to third-party Ruby plugins, not to Sphinx itself.

For one, manually setting up main+delta indexing should normally take maybe 30-60 minutes. And that should keep the indexing lag in 1-5 minutes range.

Also, trivial config for 1 data source and 1 index should be maybe 20 lines long. That isn't exactly "big and scary" to me...

On the other hand there indeed still are situations when Solr (or even MySQL builtin fulltext search) is handier to use. If the data size allows, why not. ;)

Yuri Volkov комментирует...

I've used that plugin to communicate with searchd daemon. It just works. Of course I had to run indexer manually to update search indexes, but it is a trivial task - just use cronjob to do that.

Анонимный комментирует...

I'm working on a project that accumulates thousands of records per day. We started out using Sphinx but it became difficult to scale. After a few weeks our index grew to 9GB. It took too long to reindex with the deltas and search times were at about 30 seconds.

It became too much of a hassle to come up with ways to partition the index. Using the recommended method - to assign a record to an index based on its ID modulo some number would mean that we'd have to rebuild the index if we ever wanted to increase the number of indexes.

We ended up moving to solr. By default, records are split up into several index files and it's easy to configure the size of the index partitions. Additionally, with sphinx, we were limited to returning only a set of ids or numeric values from the index. With Solr we can return string data and avoid having to hit the database by keeping the needed data in the index.

If sphinx offered an automatic index partitioning scheme, I would think twice about it.

Unknown комментирует...

dkastner,

30 second queries vs. 9 GB dataset are a clear indication that something was misconfigured on your side. Average response time against 10 GB index should be well within 1 second. Actually, I'd expect it to be within ~0.1 second.

Perhaps you overpartitioned the data and created way too much indexes. That could easily kill searching performance.

Thousands of incoming records per day is not much at all. Normally you'd just setup two indexes, with delta carrying data for current day, and reindex the main chunk nightly.

Or perhaps you were hitting PHP network IO issues; 30 seconds looks suspiciously close to its default network timeout time.

In any case 30+ second response times are something.. extremely unusual. One has to jump through special hoops to achieve that level of performance. ;)

I can witness 10-30 second queries myself - but they were a) against 100+ GB collection, b) involving most frequent words, retrieving 10s of millions of documents, c) executed on a single CPU and single HDD machine, and d) under heavy concurrent load.

Unknown комментирует...

You should also checkout thinking sphinx, a much nicer rails plugin to use: http://ts.freelancing-gods.com/