Archive for Sphinx Search Engine

Sphinx Search Engine Comparative Benchmarks

Benchmark Overview

This study is a comparative benchmark of Sphinx searchd performance, particularly looking at the effect of distributing an index across multiple CPU cores and/or multiple CPUs.

Absolute performance is highly dependent on the data and query set – the number of documents, the size of each document, the number of query terms, and the relative distribution of words and in documents and in the queries. To establish comparative performance, this study uses one set of data comprising 1.9M records in a total of 629MB, and a set of 10,000 multi-term queries relevant to the data set, averaging 3.1 words per query. For the test which compares performance on indexes of various size, the original data set is decimated by a factor to make smaller data sets with similar characteristics, or duplicated multiple times with different document identifiers to make larger data sets. In these cases the number of results and number of documents to be returned will be proportional to the data set size, but in all other respects the characteristics of the data set will remain constant.

The benchmark is based on Sphinx 1.11-dev (r2672) (February 2011), match mode ‘MATCH_ANY’, sort mode ‘RELEVANCE’, limited to 1000 results per query, returning the first 100. The set of 10,000 queries contained 6770 unique queries.

Comparison of worker models

Sphinx provides four multiprocessing (MPM) worker models – ‘fork’, ‘prefork’, ‘threads’ and ‘none’. In the ‘fork’ model, a new process is forked as required for each query as it arrives. In ‘prefork’, a set number of processes are forked when searchd starts, and each picks up a new query when it is free to do so and there are queries waiting, thereby avoiding the forking overhead. The ‘threads’ model is like ‘fork’, but using threads instead of forking, which is a smaller overhead. Under the ‘none’ model, a single process handles all queries sequentially.

In the following benchmarks, the query rate (requests per second) is measured over 10,000 standard queries with the index split into up to 8 parts and queried as a distributed index, for each of the MPM worker models, with query loads of 1, 2, 4 or 8 parallel clients. In the case of 1 client, all 10,000 queries are sent sequentially at maximum rate; in the case of 2 clients, the first client sends the first 5,000 queries and the second sends the second 5,000 queries in parallel at maximum rate, and so on. The distributed index is queried by setting the first index as a ‘local’ index so it is searched by the process that received the query, and the remaining indexes are set as ‘agents’ such that the first process sends the queries off to the agents, searches its local index, then collects the responses from the agents and merges the results.

The benchmarks have been performed on a server known as ‘titan3′, an 8-core (strictly, 4 cores with hyper-threading) server with 16GB RAM running Fedora 12.

Fork worker

Prefork Worker

Threads Worker

None Worker

Each of the graphs shows the trade-off between the larger number of indexes with faster, parallelised search vs the overhead of merging the results from the indexes, and the extra contention for available threads when the number of indexes and number of clients increases.

Data for the ‘none’ worker is provided just for reference; it is not a model that one would choose to use in normal circumstances as there is no gain to be made through parallelisation and longer queries block shorter queries since all are handled sequentially. The test with the ‘none’ worker had to be performed using the ‘local’ index model as the ‘agent’ model does not function correctly where only one process is available. As expected there is a gradual decline in performance as the number of indexes increases, but in fact the performance is relatively constant compared to the variation seen for the other models.

Excluding the ‘none’ worker model, performance is essentially similar with the exception of the prefork model under high load, where apparently the number of preforked processes is insufficient. Otherwise, the threaded worker is slightly faster than the fork worker, and the prefork worker sometimes has a small advantage over the fork worker. The threaded worker has a larger margin of performance over the fork worker with a larger number of indexes.

Thus the prefork worker has little to recommend it, being of negligible gain over fork but of potentially great loss. Whilst the threads worker has the best performance, it also carries the risk that if there is a bug in searchd that causes the process to crash, all services are lost, whereas with the fork worker only the crashed process is lost and searchd otherwise continues to receive queries and fork new processes.

Hence the threaded worker is the best choice for performance, but for the risk-averse the fork worker provides reliable services at only a small loss of performance.

Peak performance occurs at 4 indexes for 1-2 query clients and 2 indexes at 4-8 query clients. On a 4-core CPU (below), the peaks occur at 4 indexes for 1 client, 3 for 2 clients, 2 for 4 client and 1 for 8 clients, illustrating the trade-off between available cores and demand for parallel processes (no of clients times no of indexes). Best performance appears to be occurring when the demand is between 1-2 times the number of cores.

Fork Worker (4-core server)

Recommendations

  • Use threads worker (highest performance, medium risk) or fork worker (high performance, low risk) depending on risk
  • Choose number of indexes such that (number of clients times number of indexes) is between 1-2 times the number of available cores

Use of dist_threads and local

Distinct from the threaded worker option described above, Sphinx 1.11-dev also offers a ‘dist_threads’ option which specifies the number of threads for handling local indexes. The previous benchmarks specified the first index as ‘local’ and the remaining indexes as ‘agent’ in the distributed index, so that the process receiving the query would search the first index and send requests to each of the other agent indexes, await the results and then merge them; where the agent index is on the same server, this would cause a network query back to the same searchd instance, which would then fork another process to handle each of the agents. With ‘dist_threads’, all of the local index searches are handled by local threads, thereby avoiding the network query and MPM worker overheads.

Effect of dist_threads

The above figure shows the variation in query rate with the number of threads allocated using the ‘dist_threads’ configuration option with all indexes specified as ‘local’. The comparison is based on 4 query clients.

The case of 1 thread is equivalent to searching each of the indexes sequentially. Notably, the performance is almost constant as the number of indexes increases, although there is a small decline in performance as would be expected due to the overhead of merging the results from each index.

Having more threads than indexes is no advantage; the results for 3 indexes at 4 and 3 threads, 2 indexes at 4, 3 and 2 threads, and 1 index for 4, 3, 2 and 1 thread are all coincident

A periodicity can be observed in the cases of 2, 3 and 4 threads, with a period of 2, 3 and 4 indexes respectively. When there are 2 threads to search 2 indexes, both searches finish in roughly equal time, or more generally if there are >= N threads to search N indexes, each thread handles no more than one index and the N used threads finish at about the same time. If there are 2 threads to search 3 indexes, then one thread must search 2 indexes so takes twice as long as the other, so that thread governs the time. As shown by the line where ‘threads=indexes’, optimal performance is achieved when all threads finish in the same time, which requires that the number of threads is equal to or at least a factor of the number of indexes.

Rather than having a situation where each thread would handle, for example, two indexes, in practice it makes sense to set the number of threads equal to the number of indexes and let the operating system manage the scheduling. It is hard to conceive of a practical situation where setting fewer threads would be an advantage.

Recommendations

  • Use dist_threads = no of local indexes
  • Never have a non-integer ratio of indexes to threads

‘local’ and ‘agent’

Effect of 'local' option

Per the discussion above, with all indexes specified as ‘local’, the search is performed sequentially so there is no performance gain as the number of indexes increases, compared to using ‘agent’ where the query can be parallelised by searchd querying itself. The above figure shows that when the number of indexes reaches 6, the network and forking overhead of the ‘agent’ approach is equal to the overhead of sequential search, so for more than 6 indexes the agent approach performs considerably worse than sequential search. Using dist_threads removes most of the overhead and performance is maintained as the number of indexes increases.

Recommendations

  • Use ‘local’ and ‘dist_threads’ in preference to ‘agent’

Comparison of multi-core and fully distributed performance

Whilst using a multi-core CPU provides parallel CPU capability, disk, memory and peripheral I/O can still present bottlenecks. The following benchmark compares distributing indexes over several independent servers against a multi-core CPU.

The first 3 servers used here are identical 4-core machines – titan1, titan2 and titan4; the next server is titan3 which performs typically 70-90% better than the others; then salmoneus and atlas, which are marginally slower than titan3; then finally sisyphus, which is slowest of all. The relative benchmarks for each server based on 4 indexes, 4 clients, fork worker are as follows:

titan1/2/4:   75 req/s
titan3:      133 req/s
salmoneus:   106 req/s
atlas:       112 req/s
sisyphus:     21 req/s

Note that as well as providing extra I/O bandwidth, the addition of a CPU also adds to the pool of available CPU cores. For example, a test using 4 indexes and 4 clients has a demand for up to 16 threads, so performance on a single 4 core machine is expected to be worse than on two 4 core machines with 2 indexes each since the latter has a total of 8 cores available to service the 16 threads.

Multi-CPU

There is a rapid drop-off in performance in going from 6 to 7 indexes with 1 index/CPU, which may be explained by the introduction of the slowest server as the seventh index. The slowest server will govern overall performance. However, the drop-off from 8 to 10 indexes with 2 indexes/CPU is more difficult to explain as the extra server in this case (salmoneus) is quite capable; this test was checked and repeated several times with consistent results. The overhead of collecting and merging results appears to have had a step increase at over 8 indexes, which is also consistent with the results for 4 indexes/CPU.

The above figure shows that a distributed index across multiple servers has significantly better performance and scalability than across multiple cores. This leads to the question as to whether disk I/O is the bottleneck. However, as shown in the figure below, where the indexes were distributed across 4 disks in the one 8-core server, there is no material gain in performance by using multiple disks. It is likely that due to the large amount of RAM in this server available for caching, the disks are relatively infrequently accessed and the test is effectively CPU-bound, so the number of available cores is the dominant factor.

Multi-disk Comparison

Comparison of performance for various sized indexes

The following data sets were created from the original 629MB set as follows:

  • 40MB by decimation of original data set by 16
  • 160MB by decimation of original data set by 4
  • 1.3GB by duplicating original data set twice with new documents IDs
  • 2.5GB by duplicating original data set four times with new documents IDs
  • 10GB by duplicating original data set 16 times with new documents IDs

Each data set was split into 1 to 8 indexes and queried with 4 clients.

The graphs below show the same data in three different ways; the first shows the log-log relationship between request rate and data size, which is more clearly shown in the second graph as a roughly linear relationship between average query time and data size, regardless of the number of indexes. The third graph shows the variation in query rate with number of indexes for each data set, showing how the smaller data sets are more sensitive to the number of indexes since the overhead of communication is more significant relative to the query time in these cases.

Per earlier results, these results show that with 4 query clients, 2-4 indexes is the optimum number, independent of data size.

The linear relationship between query time and data size therefore suggests that for N indexes with no contention (disk/threads/memory etc), the search time can be reduced to Formula does not parse: 1 \over N , plus roughly Formula does not parse: N \times merging overhead. From the earlier discussion on the availability of CPU cores to service the thread demand, it is likely that the apparent optimum of 2-4 indexes is driven by CPU core contention, and at higher numbers of indexes the merging overheads also contribute. So in the fully distributed case where core contention is less of an issue, performance may be increased by merging results hierarchically so the merging overhead reduces from Formula does not parse: N T_m to Formula does not parse: T_m log N in the ideal case (where Formula does not parse: T_m is the time to merge the results for each index).

The following figure compares the situation of 2 cores/CPU distributed across multiple servers with 4 query clients, in the one case merging all results on the first server, and in the second case merging the results hierarchically to 2 levels (by specifying a distributed index on each server that queries the two local indexes). Clearly the performance is significantly improved at 10 indexes and above.

Comparison of Distributed Index with Hierarchical Organisation

Comparison of Query Modes

Given that performance appears to be significantly governed by the number of results returned, it would be expected that a MATCH_ALL query mode that causes many fewer results to be generated would be faster than the MATCH_ANY query mode. This is demonstrated in the following graph:

Comparison of Query Modes

Comparison of Search Results

When the indexes are distributed, a limited set of results are retrieved from each index, partly ranked and then centrally merged, compared to the single index situation where the ranking occurs with a full view of the data set. This can potentially lead to differences between results. To evaluate this, the set of results for the 8770 unique queries in the test set were saved to disk from a single index and from a 4-part distributed index, and then compared. The measures of comparison were:

  • Total results match – the number of times that the same number of results were returned from both sources (most of the time 100 results would have been returned, except when few matches were found).
  • Total found match – the number of times that the reported number of total documents found was the same from both sources.
  • Top 10 match – the number of times that the top 10 documents contained the same set (not necessarily in the same order)
  • Doc set match – the number of times that both sources returned the same set of results (not necessarily in the same order)
  • Doc groups match – Each document is returned with a weight, which may be an integer, and for many searches the weight will be the same for many documents, and therefore sorting by weight within each group of equally-weighted documents is arbitrary. Also, if there are more than 100 results, the set of documents selected for inclusion in the top 100 from the equally-weighted documents in the group at the cut-off point is arbitrary. The docs group match ignores the group of lowest weighted documents in the set and measures the degree of match in the remaining sets.
  • Doc mismatch – the number of documents in the first set of results that are not found in the second set.
  • Max mismatch – the maximum number of documents that are mismatched between the two sources for any query
  • Median mismatch – the medium number of documents that are mismatched across all queries (excluding those that have no mismatches)
  • Mismatch distribution – the number of times there was 1, 2, 3, up to 10 mismatched documents found

The following are the results for single and four-part indexes, with match modes of ‘ANY’ and ‘ALL’.

Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ANY)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6586 / 6770 (97.28%)
 Top 10 match                  : 6374 / 6770 (94.15%)
 Doc set match                 : 5563 / 6770 (82.17%)
 Doc groups match              : 5389 / 5993 (89.92%)
 Doc mismatch                  : 4532 / 673551 (0.67%)
 Max mismatch                  : 90 (604.00%)
 Median mismatch               : 1.5 (1.50%)
 Mismatch distribution         : 1-288 2-97 3-37 4-20 5-12 6-7 7-9 8-10 9-7 10-3

Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ALL)

 Total results match           : 6735 / 6770 (99.48%)
 Total found match             : 6642 / 6770 (98.11%)
 Top 10 match                  : 6641 / 6770 (98.09%)
 Doc set match                 : 6666 / 6770 (98.46%)
 Doc groups match              : 3872 / 3981 (97.26%)
 Doc mismatch                  : 1951 / 363309 (0.54%)
 Max mismatch                  : 95 (109.00%)
 Median mismatch               : 4 (4.00%)
 Mismatch distribution         : 1-15 2-8 3-9 4-11 5-1 6-2 7-6 8-1 9-1 10-3

Notably there are some cases where the different approaches return a different number of results.

In considering the upgrade from Sphinx 0.9.9 to 1.1, it is also of interest to compare the difference between the two versions:
[edit] Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ANY)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6769 / 6770 (99.99%)
 Top 10 match                  : 6535 / 6770 (96.53%)
 Doc set match                 : 5907 / 6770 (87.25%)
 Doc groups match              : 5760 / 5993 (96.11%)
 Doc mismatch                  : 6365 / 673551 (0.94%)
 Max mismatch                  : 99 (233.00%)
 Median mismatch               : 1 (1.00%)
 Mismatch distribution         : 1-92 2-19 3-7 4-3 5-1 6-2 7-0 8-1 9-2 10-2

Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ALL)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6769 / 6770 (99.99%)
 Top 10 match                  : 6706 / 6770 (99.05%)
 Doc set match                 : 6707 / 6770 (99.07%)
 Doc groups match              : 3888 / 3981 (97.66%)
 Doc mismatch                  : 2436 / 363309 (0.67%)
 Max mismatch                  : 99 (93.00%)
 Median mismatch               : 10 (10.00%)
 Mismatch distribution         : 1-15 2-7 3-5 4-4 5-2 6-5 7-0 8-2 9-2 10-3

Summary

There is a big advantage in almost every situation to distributing the sphinx index into two parts on a multi-core or multi-CPU server or servers. For light loads and with plenty of spare parallel CPU capacity, splitting into more parts may lead to further advantage. Going above 8 parts could cause performance degradation. In general, using 2-4 indexes provides the best performance in most situations. These recommendations assume a minimum of a 4 core server.

Comments

Sphinx Search Engine Performance

The following is a summary of some real-world data collected from the Sphinx query logs on a cluster of 15 servers. Each server runs its own copy of Sphinx, Apache, a busy web application, MySQL and miscellaneous services.

The dataset contains 453 million query log instances from 180 Sphinx indexes, collected over several months, using Sphinx version 0.9.8 on Linux kernel 2.6.18. The servers are all Dell PowerEdge 1950 with Quad Core Intel® Xeon® E5335, 2×4MB Cache, 2.0GHz, 1333MHz FSB, SATA drives, 7200rpm.

Keep in mind, though, that this is real world data and not a controlled test. This is how Sphinx performed in our environment, for the particular way we use Sphinx.

The graph below displays the response time distribution for all servers and all indexes, and shows, for example, that 60% of queries complete within 0.01 secs, 80% within 0.1 secs and 99% within 0.5 secs. Response times tend to occur in 3 bands (corresponding to the peaks in the frequency graph) – <0.001 sec, 0.03 sec and 0.3secs, which partly relates to the number of disk accesses required to fulfil a request. At 0.001 sec, all data is in memory, while at 0.3 secs, several disk accesses are occurring. Whilst the middle peak is not so obvious in this graph, the per-server or per-index graphs often have different distributions but still tend to have peaks at one or more of these three bands.
Sphinx Query Response Times Total for all servers, all indexes

The next observation is that query word count affects performance, but not necessarily in proportion to the number of query words, as shown in the graph below. 1-4 word queries consistently offer best performance. The 6-50 words range is consistently the slowest, most likely because the chance of finding documents with multiple matches is high so there is extra ranking effort involved. Above 50, there is presumably a higher chance of having words with few matches, which speeds up the ranking process.
Sphinx Query Response Time by Query Word Count

Finally, we see that the size of the inverted index (.spd files) also affects performance. The three graphs below show how the response time distribution tends to move to the right as the index size increases. The larger the index, the higher the chance that data will need to be re-read from disk (rather than from Sphinx-internal or system buffers/cache), hence this is not unexpected.
Sphinx Query Response Times for Index Sizes 1MB - 3MB
Sphinx Query Response Times for Index Sizes 3MB - 30MBSphinx Query Response Times for Index Sizes >30MB

Here is a PDF summary of Sphinx performance for this dataset, including many additional graphs of the data by server and by index.

Comments

Integrating Sphinx into Perl Applications

Sphinx is a full-text search engine (http://www.sphinxsearch.com) designed
primarily for full-text search of database content.  It has many features but in
my opinion its best assets are speed of search and scalability.

We started using Sphinx when MySQL built-in full-text search was becoming too
slow and too CPU intensive, and of questionable accuracy.  Sphinx is lightning
fast compared to MySQL and provides better results relevancy.

This note is about integration with the standalone Sphinx search server. Sphinx
also has a component (‘SphinxSE’) that runs as a MySQL 5 engine so can be used as
a direct replacement for MySQL full-text search; to use SphinxSE, standard Perl
DBI should be all that is necessary.

What you will need:

The following CPAN modules are likely to be useful:

Sphinx::Search
Sphinx::Manager
Sphinx::Config

Sphinx::Manager provides facilities to start and stop the search server and to
run the indexer.

Sphinx::Search provides the search API.

Sphinx::Config allows you to read/write the Sphinx configuration files from
code, in case you wish to maintain the configuration elsewhere (e.g. in your
database).

Putting it all together:

Running the Sphinx searchd server

Sphinx operates most efficiently if it is allowed to run persistently as a
background service.  Theoretically, you could start the Sphinx server, do a
search and then stop it on every request, with a small amount of overhead – but
here we will consider just the typical case.

Ideally you will use your operating system tools start such as daemontools,
monit or just the SysV startup scripts to start and monitor searchd, rather than
have to worry about it in your perl app.  But, if you need or want to start it
in perl:

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->start_searchd;

You should verify that the effective UID of your perl app has all of the appropriate
permissions:

  • to create and write to the PID file (see ’searchd’ section of config, ‘pid_file’)
  • to create and write to the log file (see ’searchd’/'log’)
  • to read the Sphinx database files (‘path’ in each of your ‘index’ specifications)

Adding Content to the Index

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->run_indexer('--rotate');

Sphinx gets its content for indexing directly from the database, according to
the ’sql_query’ given in the config file.  ‘run_indexer’ simply runs the command
line version of the Sphinx indexer program.  You can pass any indexer arguments
through to ‘run_indexer’; ‘–rotate’ is typical, to force searchd to start using
the newly created index without disrupting searches while indexing is
occurring.

Searching

Make sure you have a version of Sphinx::Search that is compatible with searchd.
A compatibility list is given at the top of the Sphinx::Search perldoc.
Hopefully a point will be reached where the Sphinx::Search client can support a
range of searchd versions, but for the moment that is impractical.

Sphinx::Search can be used with any logging object that supports error, warn,
info and debug methods.  In this example I have used Log::Log4perl.

  use Sphinx::Search;
  use Log::Log4perl qw(:easy);
  Log::Log4perl->easy_init($DEBUG);
  $sph = Sphinx::Search->new( log => Log::Log4perl->get_logger('sphinx.search') );
  my $results = $sph->setMatchMode(SPH_MATCH_ALL)
                    ->Query("...");

Configuring

Sphinx::Config provides the tools to read and write the Sphinx configuration file.

A typical problem is that searchd is running on a non-standard port (the default
is 3312), so how will your perl app know where to find it?  Obviously you don’t
want to hard-code port numbers in case they change…

use Sphinx::Search;
use Sphinx::Config;
use Log::Log4perl qw(:easy);

Log::Log4perl->easy_init($DEBUG);

$sph = Sphinx::Search->new( log => Log::Log4perl->get_logger(’sphinx.search’) );

# Get port from config file
$conf = Sphinx::Config->new;
$conf->parse(‘/etc/sphinx.conf’);
my $port = $conf->get(’searchd’, undef, ‘port’);

# Tell Sphinx client
$sph->setServer(‘localhost’, $port);

my $results = $sph->Query(“…”);


Enjoy

We have had a considerable amount of success using Perl and Sphinx.  I hope you
do too.


Comments

Sphinx::Search 0.08 released to CPAN

I have just uploaded to CPAN the latest version of Sphinx::Search, the Perl API for the Sphinx Search Engine.

Search for Sphinx::Search on CPAN to get the latest.

Version 0.08 is suitable for Sphinx 0.9.8-svn-r871 and later (currently r909). This version fixes a couple of bugs related to error checking.

I have been asked a few times what makes Sphinx::Search different from the Perl API that comes bundled in the contrib directory of the Sphinx distribution. The bundled Sphinx.pm was used as the starting point of Sphinx::Search. Maintenance of that version appears to have lapsed at sphinx-0.9.7, so many of the newer API calls are not available there. Sphinx::Search is mostly compatible with the old Sphinx.pm except:

  • On failure, Sphinx::Search returns undef rather than 0 or -1.
  • Sphinx::Search ’Set’ functions are cascadable, e.g. you can do
    Sphinx::Search->new ->SetMatchMode(SPH_MATCH_ALL) ->SetSortMode(SPH_SORT_RELEVANCE) ->Query("search terms")
  • Sphinx::Search also provides documentation and unit tests, which were the main motivations for branching from the earlier work.

Sphinx has proven to be a very efficient and better quality search engine than the built-in MySQL full text search. It is an order of magnitude faster for large data sets and provides better options for controlling search result relevancy.

Comments (1)