Thinking Sphinx on rails projects

I have been working with Thinking Sphinx for the past year and have made some observations. When I first started using it I thought it was a pretty neat tool. The concept is basically you define a set of relationships through a sphinx index. Then sphinx will build a query around them and cache that query for reuse. The query will be optimized so it will run very fast. This all works great when your talking about simple relationships. Which basically means a has_one or a has_many without a through relationship inside of the model. The problem with Sphinx is when you get into more complex datasets.

For example when you are creating a complex scope. Say you already created that scope using rails active record. You know that the scope works and you want sphinx to use that same scope. The issue is that sphinx makes you create the index in their format. So you are unable to reuse that same scope. It’s not a huge deal but it’s duplicating code just for indexing purposes. So if a future developer comes in and just updates the rails active record scope the sphinx indexing will be off. It might not even be noticeable until a user does a specific search. This can lead to bugs very easily.

Thinking sphinx also has a concept of indexes vs has. An index is a text object that will be able to be searched. A has relationship is made for numeric values so they can be sorted as a numeric value instead of a text value. So if you want one of these has relationships to be searchable you will need to define it as an index and has relationship. Another form a duplication where it should be smart enough to know if it’s indexed or has a has relationship it should be both sortable and searchable.

The last point as once you get a lot of data indexing is a pain. With a small amount of data indexing takes only around a minute. As you keep adding data the indexing process gets slower and slower. Eventually it takes a good half and hour to finish. It’s not a huge deal in a production environment that you can run indexing over night. The problem is when you are unable to have a slow time. If your site is getting hits all times of the day and even during the night you cannot slow it down for indexing. So you are kind of trapped in a way. Also indexing is a pain for developers. If they pull down a backup of the large dataset they have to wait that full indexing time before they can continue the development process. If they are working on a feature that they need to use the index for. The rest of the site will work just not the part that is using the sphinx index.

When you look at the big picture sphinx could be a great tool depending on the data set you have. For example if you have a complex dataset that requires a lot of joins sphinx probably isn’t the answer. It might be more pain to setup than it’s worth. If you have a simple table structure sphinx might be a great choice. It just all depends on how much data you have and how complex your data structure is. I would recommend exploring it but keeping in mind the more complex the structure the more time it takes to setup. Also the more data the longer indexing takes. So you have to weight that into if you are going to use this tool or not.

Leave a Reply

Your email address will not be published. Required fields are marked *