The following is a summary of some real-world data collected from the Sphinx query logs on a cluster of 15 servers. Each server runs its own copy of Sphinx, Apache, a busy web application, MySQL and miscellaneous services.
The dataset contains 453 million query log instances from 180 Sphinx indexes, collected over several months, using Sphinx version 0.9.8 on Linux kernel 2.6.18. The servers are all Dell PowerEdge 1950 with Quad Core Intel® Xeon® E5335, 2×4MB Cache, 2.0GHz, 1333MHz FSB, SATA drives, 7200rpm.
Keep in mind, though, that this is real world data and not a controlled test. This is how Sphinx performed in our environment, for the particular way we use Sphinx.
The graph below displays the response time distribution for all servers and all indexes, and shows, for example, that 60% of queries complete within 0.01 secs, 80% within 0.1 secs and 99% within 0.5 secs. Response times tend to occur in 3 bands (corresponding to the peaks in the frequency graph) – <0.001 sec, 0.03 sec and 0.3secs, which partly relates to the number of disk accesses required to fulfil a request. At 0.001 sec, all data is in memory, while at 0.3 secs, several disk accesses are occurring. Whilst the middle peak is not so obvious in this graph, the per-server or per-index graphs often have different distributions but still tend to have peaks at one or more of these three bands.
The next observation is that query word count affects performance, but not necessarily in proportion to the number of query words, as shown in the graph below. 1-4 word queries consistently offer best performance. The 6-50 words range is consistently the slowest, most likely because the chance of finding documents with multiple matches is high so there is extra ranking effort involved. Above 50, there is presumably a higher chance of having words with few matches, which speeds up the ranking process.
Finally, we see that the size of the inverted index (.spd files) also affects performance. The three graphs below show how the response time distribution tends to move to the right as the index size increases. The larger the index, the higher the chance that data will need to be re-read from disk (rather than from Sphinx-internal or system buffers/cache), hence this is not unexpected.
Here is a PDF summary of Sphinx performance for this dataset, including many additional graphs of the data by server and by index.