This study is a comparative benchmark of Sphinx searchd performance, particularly looking at the effect of distributing an index across multiple CPU cores and/or multiple CPUs.
Absolute performance is highly dependent on the data and query set – the number of documents, the size of each document, the number of query terms, and the relative distribution of words and in documents and in the queries. To establish comparative performance, this study uses one set of data comprising 1.9M records in a total of 629MB, and a set of 10,000 multi-term queries relevant to the data set, averaging 3.1 words per query. For the test which compares performance on indexes of various size, the original data set is decimated by a factor to make smaller data sets with similar characteristics, or duplicated multiple times with different document identifiers to make larger data sets. In these cases the number of results and number of documents to be returned will be proportional to the data set size, but in all other respects the characteristics of the data set will remain constant.
The benchmark is based on Sphinx 1.11-dev (r2672) (February 2011), match mode ‘MATCH_ANY’, sort mode ‘RELEVANCE’, limited to 1000 results per query, returning the first 100. The set of 10,000 queries contained 6770 unique queries.
Comparison of worker models
Sphinx provides four multiprocessing (MPM) worker models – ‘fork’, ‘prefork’, ‘threads’ and ‘none’. In the ‘fork’ model, a new process is forked as required for each query as it arrives. In ‘prefork’, a set number of processes are forked when searchd starts, and each picks up a new query when it is free to do so and there are queries waiting, thereby avoiding the forking overhead. The ‘threads’ model is like ‘fork’, but using threads instead of forking, which is a smaller overhead. Under the ‘none’ model, a single process handles all queries sequentially.
In the following benchmarks, the query rate (requests per second) is measured over 10,000 standard queries with the index split into up to 8 parts and queried as a distributed index, for each of the MPM worker models, with query loads of 1, 2, 4 or 8 parallel clients. In the case of 1 client, all 10,000 queries are sent sequentially at maximum rate; in the case of 2 clients, the first client sends the first 5,000 queries and the second sends the second 5,000 queries in parallel at maximum rate, and so on. The distributed index is queried by setting the first index as a ‘local’ index so it is searched by the process that received the query, and the remaining indexes are set as ‘agents’ such that the first process sends the queries off to the agents, searches its local index, then collects the responses from the agents and merges the results.
The benchmarks have been performed on a server known as ‘titan3′, an 8-core (strictly, 4 cores with hyper-threading) server with 16GB RAM running Fedora 12.
Each of the graphs shows the trade-off between the larger number of indexes with faster, parallelised search vs the overhead of merging the results from the indexes, and the extra contention for available threads when the number of indexes and number of clients increases.
Data for the ‘none’ worker is provided just for reference; it is not a model that one would choose to use in normal circumstances as there is no gain to be made through parallelisation and longer queries block shorter queries since all are handled sequentially. The test with the ‘none’ worker had to be performed using the ‘local’ index model as the ‘agent’ model does not function correctly where only one process is available. As expected there is a gradual decline in performance as the number of indexes increases, but in fact the performance is relatively constant compared to the variation seen for the other models.
Excluding the ‘none’ worker model, performance is essentially similar with the exception of the prefork model under high load, where apparently the number of preforked processes is insufficient. Otherwise, the threaded worker is slightly faster than the fork worker, and the prefork worker sometimes has a small advantage over the fork worker. The threaded worker has a larger margin of performance over the fork worker with a larger number of indexes.
Thus the prefork worker has little to recommend it, being of negligible gain over fork but of potentially great loss. Whilst the threads worker has the best performance, it also carries the risk that if there is a bug in searchd that causes the process to crash, all services are lost, whereas with the fork worker only the crashed process is lost and searchd otherwise continues to receive queries and fork new processes.
Hence the threaded worker is the best choice for performance, but for the risk-averse the fork worker provides reliable services at only a small loss of performance.
Peak performance occurs at 4 indexes for 1-2 query clients and 2 indexes at 4-8 query clients. On a 4-core CPU (below), the peaks occur at 4 indexes for 1 client, 3 for 2 clients, 2 for 4 client and 1 for 8 clients, illustrating the trade-off between available cores and demand for parallel processes (no of clients times no of indexes). Best performance appears to be occurring when the demand is between 1-2 times the number of cores.
- Use threads worker (highest performance, medium risk) or fork worker (high performance, low risk) depending on risk
- Choose number of indexes such that (number of clients times number of indexes) is between 1-2 times the number of available cores
Use of dist_threads and local
Distinct from the threaded worker option described above, Sphinx 1.11-dev also offers a ‘dist_threads’ option which specifies the number of threads for handling local indexes. The previous benchmarks specified the first index as ‘local’ and the remaining indexes as ‘agent’ in the distributed index, so that the process receiving the query would search the first index and send requests to each of the other agent indexes, await the results and then merge them; where the agent index is on the same server, this would cause a network query back to the same searchd instance, which would then fork another process to handle each of the agents. With ‘dist_threads’, all of the local index searches are handled by local threads, thereby avoiding the network query and MPM worker overheads.
The above figure shows the variation in query rate with the number of threads allocated using the ‘dist_threads’ configuration option with all indexes specified as ‘local’. The comparison is based on 4 query clients.
The case of 1 thread is equivalent to searching each of the indexes sequentially. Notably, the performance is almost constant as the number of indexes increases, although there is a small decline in performance as would be expected due to the overhead of merging the results from each index.
Having more threads than indexes is no advantage; the results for 3 indexes at 4 and 3 threads, 2 indexes at 4, 3 and 2 threads, and 1 index for 4, 3, 2 and 1 thread are all coincident
A periodicity can be observed in the cases of 2, 3 and 4 threads, with a period of 2, 3 and 4 indexes respectively. When there are 2 threads to search 2 indexes, both searches finish in roughly equal time, or more generally if there are >= N threads to search N indexes, each thread handles no more than one index and the N used threads finish at about the same time. If there are 2 threads to search 3 indexes, then one thread must search 2 indexes so takes twice as long as the other, so that thread governs the time. As shown by the line where ‘threads=indexes’, optimal performance is achieved when all threads finish in the same time, which requires that the number of threads is equal to or at least a factor of the number of indexes.
Rather than having a situation where each thread would handle, for example, two indexes, in practice it makes sense to set the number of threads equal to the number of indexes and let the operating system manage the scheduling. It is hard to conceive of a practical situation where setting fewer threads would be an advantage.
- Use dist_threads = no of local indexes
- Never have a non-integer ratio of indexes to threads
‘local’ and ‘agent’
Per the discussion above, with all indexes specified as ‘local’, the search is performed sequentially so there is no performance gain as the number of indexes increases, compared to using ‘agent’ where the query can be parallelised by searchd querying itself. The above figure shows that when the number of indexes reaches 6, the network and forking overhead of the ‘agent’ approach is equal to the overhead of sequential search, so for more than 6 indexes the agent approach performs considerably worse than sequential search. Using dist_threads removes most of the overhead and performance is maintained as the number of indexes increases.
- Use ‘local’ and ‘dist_threads’ in preference to ‘agent’
Comparison of multi-core and fully distributed performance
Whilst using a multi-core CPU provides parallel CPU capability, disk, memory and peripheral I/O can still present bottlenecks. The following benchmark compares distributing indexes over several independent servers against a multi-core CPU.
The first 3 servers used here are identical 4-core machines – titan1, titan2 and titan4; the next server is titan3 which performs typically 70-90% better than the others; then salmoneus and atlas, which are marginally slower than titan3; then finally sisyphus, which is slowest of all. The relative benchmarks for each server based on 4 indexes, 4 clients, fork worker are as follows:
titan1/2/4: 75 req/s titan3: 133 req/s salmoneus: 106 req/s atlas: 112 req/s sisyphus: 21 req/s
Note that as well as providing extra I/O bandwidth, the addition of a CPU also adds to the pool of available CPU cores. For example, a test using 4 indexes and 4 clients has a demand for up to 16 threads, so performance on a single 4 core machine is expected to be worse than on two 4 core machines with 2 indexes each since the latter has a total of 8 cores available to service the 16 threads.
There is a rapid drop-off in performance in going from 6 to 7 indexes with 1 index/CPU, which may be explained by the introduction of the slowest server as the seventh index. The slowest server will govern overall performance. However, the drop-off from 8 to 10 indexes with 2 indexes/CPU is more difficult to explain as the extra server in this case (salmoneus) is quite capable; this test was checked and repeated several times with consistent results. The overhead of collecting and merging results appears to have had a step increase at over 8 indexes, which is also consistent with the results for 4 indexes/CPU.
The above figure shows that a distributed index across multiple servers has significantly better performance and scalability than across multiple cores. This leads to the question as to whether disk I/O is the bottleneck. However, as shown in the figure below, where the indexes were distributed across 4 disks in the one 8-core server, there is no material gain in performance by using multiple disks. It is likely that due to the large amount of RAM in this server available for caching, the disks are relatively infrequently accessed and the test is effectively CPU-bound, so the number of available cores is the dominant factor.
Comparison of performance for various sized indexes
The following data sets were created from the original 629MB set as follows:
- 40MB by decimation of original data set by 16
- 160MB by decimation of original data set by 4
- 1.3GB by duplicating original data set twice with new documents IDs
- 2.5GB by duplicating original data set four times with new documents IDs
- 10GB by duplicating original data set 16 times with new documents IDs
Each data set was split into 1 to 8 indexes and queried with 4 clients.
The graphs below show the same data in three different ways; the first shows the log-log relationship between request rate and data size, which is more clearly shown in the second graph as a roughly linear relationship between average query time and data size, regardless of the number of indexes. The third graph shows the variation in query rate with number of indexes for each data set, showing how the smaller data sets are more sensitive to the number of indexes since the overhead of communication is more significant relative to the query time in these cases.
Per earlier results, these results show that with 4 query clients, 2-4 indexes is the optimum number, independent of data size.
The linear relationship between query time and data size therefore suggests that for N indexes with no contention (disk/threads/memory etc), the search time can be reduced to , plus roughly merging overhead. From the earlier discussion on the availability of CPU cores to service the thread demand, it is likely that the apparent optimum of 2-4 indexes is driven by CPU core contention, and at higher numbers of indexes the merging overheads also contribute. So in the fully distributed case where core contention is less of an issue, performance may be increased by merging results hierarchically so the merging overhead reduces from to in the ideal case (where is the time to merge the results for each index).
The following figure compares the situation of 2 cores/CPU distributed across multiple servers with 4 query clients, in the one case merging all results on the first server, and in the second case merging the results hierarchically to 2 levels (by specifying a distributed index on each server that queries the two local indexes). Clearly the performance is significantly improved at 10 indexes and above.
Comparison of Query Modes
Given that performance appears to be significantly governed by the number of results returned, it would be expected that a MATCH_ALL query mode that causes many fewer results to be generated would be faster than the MATCH_ANY query mode. This is demonstrated in the following graph:
Comparison of Search Results
When the indexes are distributed, a limited set of results are retrieved from each index, partly ranked and then centrally merged, compared to the single index situation where the ranking occurs with a full view of the data set. This can potentially lead to differences between results. To evaluate this, the set of results for the 8770 unique queries in the test set were saved to disk from a single index and from a 4-part distributed index, and then compared. The measures of comparison were:
- Total results match – the number of times that the same number of results were returned from both sources (most of the time 100 results would have been returned, except when few matches were found).
- Total found match – the number of times that the reported number of total documents found was the same from both sources.
- Top 10 match – the number of times that the top 10 documents contained the same set (not necessarily in the same order)
- Doc set match – the number of times that both sources returned the same set of results (not necessarily in the same order)
- Doc groups match – Each document is returned with a weight, which may be an integer, and for many searches the weight will be the same for many documents, and therefore sorting by weight within each group of equally-weighted documents is arbitrary. Also, if there are more than 100 results, the set of documents selected for inclusion in the top 100 from the equally-weighted documents in the group at the cut-off point is arbitrary. The docs group match ignores the group of lowest weighted documents in the set and measures the degree of match in the remaining sets.
- Doc mismatch – the number of documents in the first set of results that are not found in the second set.
- Max mismatch – the maximum number of documents that are mismatched between the two sources for any query
- Median mismatch – the medium number of documents that are mismatched across all queries (excluding those that have no mismatches)
- Mismatch distribution – the number of times there was 1, 2, 3, up to 10 mismatched documents found
The following are the results for single and four-part indexes, with match modes of ‘ANY’ and ‘ALL’.
Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ANY)
Total results match : 6770 / 6770 (100.00%) Total found match : 6586 / 6770 (97.28%) Top 10 match : 6374 / 6770 (94.15%) Doc set match : 5563 / 6770 (82.17%) Doc groups match : 5389 / 5993 (89.92%) Doc mismatch : 4532 / 673551 (0.67%) Max mismatch : 90 (604.00%) Median mismatch : 1.5 (1.50%) Mismatch distribution : 1-288 2-97 3-37 4-20 5-12 6-7 7-9 8-10 9-7 10-3
Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ALL)
Total results match : 6735 / 6770 (99.48%) Total found match : 6642 / 6770 (98.11%) Top 10 match : 6641 / 6770 (98.09%) Doc set match : 6666 / 6770 (98.46%) Doc groups match : 3872 / 3981 (97.26%) Doc mismatch : 1951 / 363309 (0.54%) Max mismatch : 95 (109.00%) Median mismatch : 4 (4.00%) Mismatch distribution : 1-15 2-8 3-9 4-11 5-1 6-2 7-6 8-1 9-1 10-3
Notably there are some cases where the different approaches return a different number of results.
In considering the upgrade from Sphinx 0.9.9 to 1.1, it is also of interest to compare the difference between the two versions:
 Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ANY)
Total results match : 6770 / 6770 (100.00%) Total found match : 6769 / 6770 (99.99%) Top 10 match : 6535 / 6770 (96.53%) Doc set match : 5907 / 6770 (87.25%) Doc groups match : 5760 / 5993 (96.11%) Doc mismatch : 6365 / 673551 (0.94%) Max mismatch : 99 (233.00%) Median mismatch : 1 (1.00%) Mismatch distribution : 1-92 2-19 3-7 4-3 5-1 6-2 7-0 8-1 9-2 10-2
Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ALL)
Total results match : 6770 / 6770 (100.00%) Total found match : 6769 / 6770 (99.99%) Top 10 match : 6706 / 6770 (99.05%) Doc set match : 6707 / 6770 (99.07%) Doc groups match : 3888 / 3981 (97.66%) Doc mismatch : 2436 / 363309 (0.67%) Max mismatch : 99 (93.00%) Median mismatch : 10 (10.00%) Mismatch distribution : 1-15 2-7 3-5 4-4 5-2 6-5 7-0 8-2 9-2 10-3
There is a big advantage in almost every situation to distributing the sphinx index into two parts on a multi-core or multi-CPU server or servers. For light loads and with plenty of spare parallel CPU capacity, splitting into more parts may lead to further advantage. Going above 8 parts could cause performance degradation. In general, using 2-4 indexes provides the best performance in most situations. These recommendations assume a minimum of a 4 core server.