Sphinx Search Engine Comparative Benchmarks

Benchmark Overview

This study is a comparative benchmark of Sphinx searchd performance, particularly looking at the effect of distributing an index across multiple CPU cores and/or multiple CPUs.

Absolute performance is highly dependent on the data and query set – the number of documents, the size of each document, the number of query terms, and the relative distribution of words and in documents and in the queries. To establish comparative performance, this study uses one set of data comprising 1.9M records in a total of 629MB, and a set of 10,000 multi-term queries relevant to the data set, averaging 3.1 words per query. For the test which compares performance on indexes of various size, the original data set is decimated by a factor to make smaller data sets with similar characteristics, or duplicated multiple times with different document identifiers to make larger data sets. In these cases the number of results and number of documents to be returned will be proportional to the data set size, but in all other respects the characteristics of the data set will remain constant.

The benchmark is based on Sphinx 1.11-dev (r2672) (February 2011), match mode ‘MATCH_ANY’, sort mode ‘RELEVANCE’, limited to 1000 results per query, returning the first 100. The set of 10,000 queries contained 6770 unique queries.

Comparison of worker models

Sphinx provides four multiprocessing (MPM) worker models – ‘fork’, ‘prefork’, ‘threads’ and ‘none’. In the ‘fork’ model, a new process is forked as required for each query as it arrives. In ‘prefork’, a set number of processes are forked when searchd starts, and each picks up a new query when it is free to do so and there are queries waiting, thereby avoiding the forking overhead. The ‘threads’ model is like ‘fork’, but using threads instead of forking, which is a smaller overhead. Under the ‘none’ model, a single process handles all queries sequentially.

In the following benchmarks, the query rate (requests per second) is measured over 10,000 standard queries with the index split into up to 8 parts and queried as a distributed index, for each of the MPM worker models, with query loads of 1, 2, 4 or 8 parallel clients. In the case of 1 client, all 10,000 queries are sent sequentially at maximum rate; in the case of 2 clients, the first client sends the first 5,000 queries and the second sends the second 5,000 queries in parallel at maximum rate, and so on. The distributed index is queried by setting the first index as a ‘local’ index so it is searched by the process that received the query, and the remaining indexes are set as ‘agents’ such that the first process sends the queries off to the agents, searches its local index, then collects the responses from the agents and merges the results.

The benchmarks have been performed on a server known as ‘titan3′, an 8-core (strictly, 4 cores with hyper-threading) server with 16GB RAM running Fedora 12.

Fork worker

Prefork Worker

Threads Worker

None Worker

Each of the graphs shows the trade-off between the larger number of indexes with faster, parallelised search vs the overhead of merging the results from the indexes, and the extra contention for available threads when the number of indexes and number of clients increases.

Data for the ‘none’ worker is provided just for reference; it is not a model that one would choose to use in normal circumstances as there is no gain to be made through parallelisation and longer queries block shorter queries since all are handled sequentially. The test with the ‘none’ worker had to be performed using the ‘local’ index model as the ‘agent’ model does not function correctly where only one process is available. As expected there is a gradual decline in performance as the number of indexes increases, but in fact the performance is relatively constant compared to the variation seen for the other models.

Excluding the ‘none’ worker model, performance is essentially similar with the exception of the prefork model under high load, where apparently the number of preforked processes is insufficient. Otherwise, the threaded worker is slightly faster than the fork worker, and the prefork worker sometimes has a small advantage over the fork worker. The threaded worker has a larger margin of performance over the fork worker with a larger number of indexes.

Thus the prefork worker has little to recommend it, being of negligible gain over fork but of potentially great loss. Whilst the threads worker has the best performance, it also carries the risk that if there is a bug in searchd that causes the process to crash, all services are lost, whereas with the fork worker only the crashed process is lost and searchd otherwise continues to receive queries and fork new processes.

Hence the threaded worker is the best choice for performance, but for the risk-averse the fork worker provides reliable services at only a small loss of performance.

Peak performance occurs at 4 indexes for 1-2 query clients and 2 indexes at 4-8 query clients. On a 4-core CPU (below), the peaks occur at 4 indexes for 1 client, 3 for 2 clients, 2 for 4 client and 1 for 8 clients, illustrating the trade-off between available cores and demand for parallel processes (no of clients times no of indexes). Best performance appears to be occurring when the demand is between 1-2 times the number of cores.

Fork Worker (4-core server)

Recommendations

  • Use threads worker (highest performance, medium risk) or fork worker (high performance, low risk) depending on risk
  • Choose number of indexes such that (number of clients times number of indexes) is between 1-2 times the number of available cores

Use of dist_threads and local

Distinct from the threaded worker option described above, Sphinx 1.11-dev also offers a ‘dist_threads’ option which specifies the number of threads for handling local indexes. The previous benchmarks specified the first index as ‘local’ and the remaining indexes as ‘agent’ in the distributed index, so that the process receiving the query would search the first index and send requests to each of the other agent indexes, await the results and then merge them; where the agent index is on the same server, this would cause a network query back to the same searchd instance, which would then fork another process to handle each of the agents. With ‘dist_threads’, all of the local index searches are handled by local threads, thereby avoiding the network query and MPM worker overheads.

Effect of dist_threads

The above figure shows the variation in query rate with the number of threads allocated using the ‘dist_threads’ configuration option with all indexes specified as ‘local’. The comparison is based on 4 query clients.

The case of 1 thread is equivalent to searching each of the indexes sequentially. Notably, the performance is almost constant as the number of indexes increases, although there is a small decline in performance as would be expected due to the overhead of merging the results from each index.

Having more threads than indexes is no advantage; the results for 3 indexes at 4 and 3 threads, 2 indexes at 4, 3 and 2 threads, and 1 index for 4, 3, 2 and 1 thread are all coincident

A periodicity can be observed in the cases of 2, 3 and 4 threads, with a period of 2, 3 and 4 indexes respectively. When there are 2 threads to search 2 indexes, both searches finish in roughly equal time, or more generally if there are >= N threads to search N indexes, each thread handles no more than one index and the N used threads finish at about the same time. If there are 2 threads to search 3 indexes, then one thread must search 2 indexes so takes twice as long as the other, so that thread governs the time. As shown by the line where ‘threads=indexes’, optimal performance is achieved when all threads finish in the same time, which requires that the number of threads is equal to or at least a factor of the number of indexes.

Rather than having a situation where each thread would handle, for example, two indexes, in practice it makes sense to set the number of threads equal to the number of indexes and let the operating system manage the scheduling. It is hard to conceive of a practical situation where setting fewer threads would be an advantage.

Recommendations

  • Use dist_threads = no of local indexes
  • Never have a non-integer ratio of indexes to threads

‘local’ and ‘agent’

Effect of 'local' option

Per the discussion above, with all indexes specified as ‘local’, the search is performed sequentially so there is no performance gain as the number of indexes increases, compared to using ‘agent’ where the query can be parallelised by searchd querying itself. The above figure shows that when the number of indexes reaches 6, the network and forking overhead of the ‘agent’ approach is equal to the overhead of sequential search, so for more than 6 indexes the agent approach performs considerably worse than sequential search. Using dist_threads removes most of the overhead and performance is maintained as the number of indexes increases.

Recommendations

  • Use ‘local’ and ‘dist_threads’ in preference to ‘agent’

Comparison of multi-core and fully distributed performance

Whilst using a multi-core CPU provides parallel CPU capability, disk, memory and peripheral I/O can still present bottlenecks. The following benchmark compares distributing indexes over several independent servers against a multi-core CPU.

The first 3 servers used here are identical 4-core machines – titan1, titan2 and titan4; the next server is titan3 which performs typically 70-90% better than the others; then salmoneus and atlas, which are marginally slower than titan3; then finally sisyphus, which is slowest of all. The relative benchmarks for each server based on 4 indexes, 4 clients, fork worker are as follows:

titan1/2/4:   75 req/s
titan3:      133 req/s
salmoneus:   106 req/s
atlas:       112 req/s
sisyphus:     21 req/s

Note that as well as providing extra I/O bandwidth, the addition of a CPU also adds to the pool of available CPU cores. For example, a test using 4 indexes and 4 clients has a demand for up to 16 threads, so performance on a single 4 core machine is expected to be worse than on two 4 core machines with 2 indexes each since the latter has a total of 8 cores available to service the 16 threads.

Multi-CPU

There is a rapid drop-off in performance in going from 6 to 7 indexes with 1 index/CPU, which may be explained by the introduction of the slowest server as the seventh index. The slowest server will govern overall performance. However, the drop-off from 8 to 10 indexes with 2 indexes/CPU is more difficult to explain as the extra server in this case (salmoneus) is quite capable; this test was checked and repeated several times with consistent results. The overhead of collecting and merging results appears to have had a step increase at over 8 indexes, which is also consistent with the results for 4 indexes/CPU.

The above figure shows that a distributed index across multiple servers has significantly better performance and scalability than across multiple cores. This leads to the question as to whether disk I/O is the bottleneck. However, as shown in the figure below, where the indexes were distributed across 4 disks in the one 8-core server, there is no material gain in performance by using multiple disks. It is likely that due to the large amount of RAM in this server available for caching, the disks are relatively infrequently accessed and the test is effectively CPU-bound, so the number of available cores is the dominant factor.

Multi-disk Comparison

Comparison of performance for various sized indexes

The following data sets were created from the original 629MB set as follows:

  • 40MB by decimation of original data set by 16
  • 160MB by decimation of original data set by 4
  • 1.3GB by duplicating original data set twice with new documents IDs
  • 2.5GB by duplicating original data set four times with new documents IDs
  • 10GB by duplicating original data set 16 times with new documents IDs

Each data set was split into 1 to 8 indexes and queried with 4 clients.

The graphs below show the same data in three different ways; the first shows the log-log relationship between request rate and data size, which is more clearly shown in the second graph as a roughly linear relationship between average query time and data size, regardless of the number of indexes. The third graph shows the variation in query rate with number of indexes for each data set, showing how the smaller data sets are more sensitive to the number of indexes since the overhead of communication is more significant relative to the query time in these cases.

Per earlier results, these results show that with 4 query clients, 2-4 indexes is the optimum number, independent of data size.

The linear relationship between query time and data size therefore suggests that for N indexes with no contention (disk/threads/memory etc), the search time can be reduced to Cannot create image: 1 \over N , plus roughly Cannot create image: N \times merging overhead. From the earlier discussion on the availability of CPU cores to service the thread demand, it is likely that the apparent optimum of 2-4 indexes is driven by CPU core contention, and at higher numbers of indexes the merging overheads also contribute. So in the fully distributed case where core contention is less of an issue, performance may be increased by merging results hierarchically so the merging overhead reduces from Cannot create image: N T_m to Cannot create image: T_m log N in the ideal case (where Cannot create image: T_m is the time to merge the results for each index).

The following figure compares the situation of 2 cores/CPU distributed across multiple servers with 4 query clients, in the one case merging all results on the first server, and in the second case merging the results hierarchically to 2 levels (by specifying a distributed index on each server that queries the two local indexes). Clearly the performance is significantly improved at 10 indexes and above.

Comparison of Distributed Index with Hierarchical Organisation

Comparison of Query Modes

Given that performance appears to be significantly governed by the number of results returned, it would be expected that a MATCH_ALL query mode that causes many fewer results to be generated would be faster than the MATCH_ANY query mode. This is demonstrated in the following graph:

Comparison of Query Modes

Comparison of Search Results

When the indexes are distributed, a limited set of results are retrieved from each index, partly ranked and then centrally merged, compared to the single index situation where the ranking occurs with a full view of the data set. This can potentially lead to differences between results. To evaluate this, the set of results for the 8770 unique queries in the test set were saved to disk from a single index and from a 4-part distributed index, and then compared. The measures of comparison were:

  • Total results match – the number of times that the same number of results were returned from both sources (most of the time 100 results would have been returned, except when few matches were found).
  • Total found match – the number of times that the reported number of total documents found was the same from both sources.
  • Top 10 match – the number of times that the top 10 documents contained the same set (not necessarily in the same order)
  • Doc set match – the number of times that both sources returned the same set of results (not necessarily in the same order)
  • Doc groups match – Each document is returned with a weight, which may be an integer, and for many searches the weight will be the same for many documents, and therefore sorting by weight within each group of equally-weighted documents is arbitrary. Also, if there are more than 100 results, the set of documents selected for inclusion in the top 100 from the equally-weighted documents in the group at the cut-off point is arbitrary. The docs group match ignores the group of lowest weighted documents in the set and measures the degree of match in the remaining sets.
  • Doc mismatch – the number of documents in the first set of results that are not found in the second set.
  • Max mismatch – the maximum number of documents that are mismatched between the two sources for any query
  • Median mismatch – the medium number of documents that are mismatched across all queries (excluding those that have no mismatches)
  • Mismatch distribution – the number of times there was 1, 2, 3, up to 10 mismatched documents found

The following are the results for single and four-part indexes, with match modes of ‘ANY’ and ‘ALL’.

Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ANY)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6586 / 6770 (97.28%)
 Top 10 match                  : 6374 / 6770 (94.15%)
 Doc set match                 : 5563 / 6770 (82.17%)
 Doc groups match              : 5389 / 5993 (89.92%)
 Doc mismatch                  : 4532 / 673551 (0.67%)
 Max mismatch                  : 90 (604.00%)
 Median mismatch               : 1.5 (1.50%)
 Mismatch distribution         : 1-288 2-97 3-37 4-20 5-12 6-7 7-9 8-10 9-7 10-3

Single Index vs Four Part Index (Sphinx 1.11-dev, MATCH_ALL)

 Total results match           : 6735 / 6770 (99.48%)
 Total found match             : 6642 / 6770 (98.11%)
 Top 10 match                  : 6641 / 6770 (98.09%)
 Doc set match                 : 6666 / 6770 (98.46%)
 Doc groups match              : 3872 / 3981 (97.26%)
 Doc mismatch                  : 1951 / 363309 (0.54%)
 Max mismatch                  : 95 (109.00%)
 Median mismatch               : 4 (4.00%)
 Mismatch distribution         : 1-15 2-8 3-9 4-11 5-1 6-2 7-6 8-1 9-1 10-3

Notably there are some cases where the different approaches return a different number of results.

In considering the upgrade from Sphinx 0.9.9 to 1.1, it is also of interest to compare the difference between the two versions:
[edit] Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ANY)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6769 / 6770 (99.99%)
 Top 10 match                  : 6535 / 6770 (96.53%)
 Doc set match                 : 5907 / 6770 (87.25%)
 Doc groups match              : 5760 / 5993 (96.11%)
 Doc mismatch                  : 6365 / 673551 (0.94%)
 Max mismatch                  : 99 (233.00%)
 Median mismatch               : 1 (1.00%)
 Mismatch distribution         : 1-92 2-19 3-7 4-3 5-1 6-2 7-0 8-1 9-2 10-2

Sphinx 0.9.9 vs 1.11-dev (Single Index, MATCH_ALL)

 Total results match           : 6770 / 6770 (100.00%)
 Total found match             : 6769 / 6770 (99.99%)
 Top 10 match                  : 6706 / 6770 (99.05%)
 Doc set match                 : 6707 / 6770 (99.07%)
 Doc groups match              : 3888 / 3981 (97.66%)
 Doc mismatch                  : 2436 / 363309 (0.67%)
 Max mismatch                  : 99 (93.00%)
 Median mismatch               : 10 (10.00%)
 Mismatch distribution         : 1-15 2-7 3-5 4-4 5-2 6-5 7-0 8-2 9-2 10-3

Summary

There is a big advantage in almost every situation to distributing the sphinx index into two parts on a multi-core or multi-CPU server or servers. For light loads and with plenty of spare parallel CPU capacity, splitting into more parts may lead to further advantage. Going above 8 parts could cause performance degradation. In general, using 2-4 indexes provides the best performance in most situations. These recommendations assume a minimum of a 4 core server.

Comments

www.multitouchanalytics.com updates

Have updated the documentation pages on Multitouch Analytics which will hopefully make the installation process a bit clearer…

Comments

Multitouch Analytics

Building on some of the ideas in the SEOMoz article How To Get Past Last-Touch Attribution With Google Analytics, MultiTouch Analytics offers a way to record and report on all of the marketing channels that contribute to any given sale. It provides a javascript library to record the touches in a cookie, and a separate reporting package to generate reports that show the contributions to each conversion from all the marketing channels.

Comments

Recent Perl Modules

Comments

Fishing Swap Meet

Comments

Scribe Server Configuration

Due to reconfiguration at Sourceforge, the Scribe Configuration page is no longer there.  This is a copy of the page http://scribeserver.wiki.sourceforge.net/Configuration rescued from Google’s cache.

Configuring Scribe

The Scribe Server can be configured by:

  1. the file specified in the -c command line option
  2. the file at DEFAULT_CONF_FILE_LOCATION in env_default.h

Global Configuration Variables

port: assigned to variable “port”

  • what port the scribe server will listen on
  • default 0, passed at command line with -p, can also be set in conf file

max_msg_per_second: assigned to variable “maxMsgPerSecond”

  • used in scribeHandler::throttleDeny
  • default 100,000

max_queue_size: in bytes, assigned to variable “maxQueueSize”

  • used in scribeHandler::Log
  • default 500,000 bytes

check_interval: in seconds, assigned to variable “checkPeriod”

  • used to control how often to check each store
  • default 5

new_thread_per_category: boolean, assigned to variable “newThreadPerCategory”

  • If true, will create a new thread for every category seen. Otherwise, will only create a single thread for every store defined in the configuration.
  • default true

Example:

port=1463
max_msg_per_second=2000000
max_queue_size=10000000
check_interval=3

Store Configuration

Scribe Server determines how to log messages based on the Stores defined in the configuration. Every store must specify what message category it handles with two exceptions:
default store: The ‘default’ category handles any category that is not handled by any other store. There can only be one default store.

  • category=default

prefix stores: If the specified category ends in a *, the store will handle all categories that begin with the specified prefix.

  • category=web*

Store Configuration Variables

category: Determines which messages are handled by this store
type:

  • Currently used types (defined in Store::createStore)
    1. file
    2. buffer
    3. network
    4. bucket
    5. thriftfile
    6. null
    7. multi

target_write_size: 16,384 bytes by default

  • determines how large to let the message queue grow for a given category before processing the messages

max_write_interval: 10 seconds by default

  • determines how long to let the messages queue for a given category before processing the messages

Example:

<store>
category=statistics
type=file
target_write_size=20480
max_write_interval=2

...

</store>

File Store Configuration

File Stores write messages to a file.

file_path: defaults to “/tmp”
base_filename: defaults to category name
rotate_period: “hourly”, “daily”, or “never”; “never” by default

  • determines how often to create new files

rotate_hour: 0-23, 1 by default

  • if rotation_period is daily, determines what hour of day to rotate

rotate_minute 0-59, 15 by default

  • if rotation_period is daily or hourly, determines how many minutes after the hour to rotate

max_size: 1,000,000,000 bytes by default

  • determines approximately how large to let a file grow before rotating to a new file

write_meta: “yes” or anything else; false by default

  • whether to log the following metadata in each file:
      1. the length of each message is prepended to the message as an unsigned integer
      2. if the file was rotated, the last line will contain “scribe_meta<new_logfile>: ” followed by the next filename

fs_type: currently only “std” is supported; “std” by default
chunk_size: 0 by default

  • if a chunk size is specified, no messages within the file will cross chunk boundaries unless there are messages larger than the chunk size

add_newlines: 0 or 1, 0 by default

  • if set to 1, will write a newline after every message

create_symlink: “yes” or anything else; “yes” by default

  • if true, will maintain a symlink that points to the most recently written file

Example:

<store>
category=sprockets
type=file
file_path=/tmp/sprockets
base_filename=sprockets_log
max_size=1000000
add_newlines=1
rotate_period=daily
rotate_hour=0
rotate_minute=10
</store>

Network Store Configuration

Network Stores forward messages to other Scribe Servers.

remote_host: name or ip of remote host to forward messages
remote_port: port number on remote host
timeout: socket timeout, in MS; defaults to DEFAULT_SOCKET_TIMEOUT_MS, which is set to 5000 in store.h
use_conn_pool: “yes” or anything else; defaults to false

  • whether to use connection pooling instead of opening up multiple connections to each remote host

Example:

<store>
category=default
type=network
remote_host=hal
remote_port=1465
</store>

Buffer Store Configuration

Buffer Stores must have two sub-stores named “primary” and “secondary”. Buffer Stores will first attempt to Log messages to the primary store and only log to the secondary if the primary is not available. Once the primary store comes back online, a Buffer store will read messages out of the secondary store and send them to the primary store. Only stores that are readable (store that implement the readOldest() method) may be used as secondary store. Currently, the only readable stores are File Stores and Null Stores.

max_queue_length: 2,000,000 messages by default

  • if the number of messages in the queue exceeds this value, the buffer store will switch to writing to the secondary store

buffer_send_rate: 1 by default

  • determines, for each check_interval, how many times to read a group of messages from the secondary store and send them to the primary store

retry_interval: 300 seconds by default

  • how long to wait to retry sending to the primary store after failing to write to the primary store

retry_interval_range: 60 seconds by default

  • will randomly pick a retry interval that is within this range of the specified retry_interval

Example:

<store>
category=default
type=buffer
buffer_send_rate=1
retry_interval=30
retry_interval_range=10

<primary>
type=network
remote_host=wopr
remote_port=1456
</primary>

<secondary>
type=file
file_path=/tmp
base_filename=thisisoverwritten
max_size=10000000
</secondary>
</store>

Note! When the network connection is re-established, the messages from the secondary store are sent one whole file at a time. Thus max_size determines not only the size of the file that triggers rotation, but also the size of the network messages; if this is too large, the receiver may not be able to handle it. Best to keep it to a number that can be comfortably handled in memory. max_size does not limit the total number of messages that can be buffered (presumably that’s limited by the amount of space available on the filesystem).

Bucket Store Configuration

Bucket Stores will hash messages to multiple files using a prefix of each message as the key.
a Bucket Store must have a substore named “bucket” that is either a File Store or ThriftFile Store.

num_buckets: defaults to 1

  • number of buckets to hash into
  • messages that cannot be hashed into any bucket will be put into a special bucket number 0

bucket_type: “key_hash” or “key_modulo”
delimiter: must be an ascii code between 0 and 255; if 0, uses DEFAULT_DELIMITER (set in common.h)

  • The message prefix up to(but not including) the first occurrence of the delimiter will be used as the key to do the hash/modulo

bucket_subdir: the name of each subdirectory will be this name followed by the bucket number
Example:

<store>
category=bucket_me
type=bucket
num_buckets=5
bucket_subdir=bucket
bucket_type=key_hash
delimiter=58

<bucket>
type=file
fs_type=std
file_path=/tmp/scribetest
base_filename=bucket_me
</bucket>
</store>

Null Store Configuration

Null Stores can be used to tell Scribe to ignore all messages of a given category.

(no configuration parameters)
Example:

<store>
category=tps_report*
type=null
</store>

Multi Store Configuration

A Multi Store is a store that will forward all messages to multiple sub-stores.

A Multi Store may have any number of substores named “store0″, “store1″, “store2″, etc
report_success: “all” or “any”, defaults to “all”

  • whether all substores or any substores must succeed in logging a message in order for the Multi Store to report the message logging as successful

Example:

<store>
category=default
type=multi
target_write_size=20480
max_write_interval=1

<store0>
type=file
file_path=/tmp/store0
</store0>

<store1>
type=file
file_path=/tmp/store1
</store1>
</store>
<store>

Thriftfile Store Configuration

A Thriftfile store is similar to a File store except that it stores messages in a Thrift TFileTransport file.

file_path: defaults to “/tmp”
base_filename: defaults to category name
rotate_period: “hourly”, “daily”, or “never”; “never” by default

  • determines how often to create new files

rotate_hour: 0-23, 1 by default

  • if rotation_period is daily, determines what hour of day to rotate

rotate_minute 0-59, 15 by default

  • if rotation_period is daily or hourly, determines how many minutes after the hour to rotate

max_size: 1,000,000,000 bytes by default

  • determines approximately how large to let a file grow before rotating to a new file

write_meta: “yes” or anything else; false by default

  • whether to log the following metadata in each file:
      1. the length of each message is prepended to the message as an unsigned integer
      2. if the file was rotated, the last line will contain “scribe_meta<new_logfile>: ” followed by the next filename

fs_type: currently only “std” is supported; “std” by default
chunk_size: 0 by default

  • if a chunk size is specified, no messages within the file will cross chunk boundaries unless there are messages larger than the chunk size

create_symlink: “yes” or anything else; “yes” by default

  • if true, will maintain a symlink that points to the most recently written file

flush_frequency_ms: milliseconds, will use TFileTransport default of 3000ms if not specified

  • determines how frequently to sync the Thrift file to disk

msg_buffer_size: in bytes, will use TFileTransport default of 0 if not specified

  • if non-zero, store will reject any writes larger than this size

Example:

<store>
category=sprockets
type=thriftfile
file_path=/tmp/sprockets
base_filename=sprockets_log
max_size=1000000
flush_frequency_ms=2000
</store>

Comments

Perl client for Facebook’s scribe logging software

Scribe is a log aggregator, developed at Facebook and released as open source. Scribe is built on Thrift, a cross-language RPC type platform, and therefore it is possible to use scribe with any of the Thrift-supported languages. Whilst Perl is one of the supported languages, there is little in the way of working examples, so here’s how I did it:

  1. Install Thrift.
  2. Build and install FB303 perl modules
      cd thrift/contrib/fb303
      # Edit if/fb303.thrift and add the line 'namespace perl Facebook.FB303' after the other namespace declarations
      thrift --gen perl if/fb303.thrift
      sudo cp -a gen-perl/ /usr/local/lib/perl5/site_perl/5.10.0 # or wherever you keep your site perl
    

    This creates the modules Facebook::FB303::Constants, Facebook::FB303::FacebookService and Facebook::FB303::Types.

  3. Install Scribe.
  4. Build and install Scribe perl modules
      cd scribe
      # Edit if/scribe.thrift and add 'namespace perl Scribe.Thrift' after the other namespace declarations
      thrift -I /path/to/thrift/contrib/ --gen perl scribe.thrift
      sudo cp -a gen-perl/Scribe /usr/local/lib/perl5/site_perl/5.10.0/ # or wherever
    
  5. This creates the modules Scribe::Thrift::Constants, Scribe::Thrift::scribe, Scribe::Thrift::Types.

      Here is an example program that uses the client (reading one line at a time from stdin and sending to a scribe instance running locally on port 1465):

      #! /usr/bin/perl
      
      use Scribe::Thrift::scribe;
      use Thrift::Socket;
      use Thrift::FramedTransport;
      use Thrift::BinaryProtocol;
      use strict;
      use warnings;
      
      my $host = 'localhost';
      my $port = 1465;
      my $cat = $ARGV[0] || 'test';
      
      my $socket = Thrift::Socket->new($host, $port);
      my $transport = Thrift::FramedTransport->new($socket);
      my $proto = Thrift::BinaryProtocol->new($transport);
      
      my $client = Scribe::Thrift::scribeClient->new($proto, $proto);
      my $le = Scribe::Thrift::LogEntry->new({ category => $cat });
      
      $transport->open();
      
      while (my $line = <>) {
          $le->message($line);
          my $result = $client->Log([ $le ]);
          if ($result == Scribe::Thrift::ResultCode::TRY_LATER) {
      	print STDERR "TRY_LATER\n";
          }
          elsif ($result != Scribe::Thrift::ResultCode::OK) {
      	print STDERR "Unknown result code: $result\n";
          }
      }
      
      $transport->close();
      

      UPDATE Log::Dispatch::Scribe is now available on CPAN. Also works with Log::Log4perl. Note though, you still need to install Thrift and Scribe perl modules as described above.

Comments

Sphinx Search Engine Performance

The following is a summary of some real-world data collected from the Sphinx query logs on a cluster of 15 servers. Each server runs its own copy of Sphinx, Apache, a busy web application, MySQL and miscellaneous services.

The dataset contains 453 million query log instances from 180 Sphinx indexes, collected over several months, using Sphinx version 0.9.8 on Linux kernel 2.6.18. The servers are all Dell PowerEdge 1950 with Quad Core Intel® Xeon® E5335, 2×4MB Cache, 2.0GHz, 1333MHz FSB, SATA drives, 7200rpm.

Keep in mind, though, that this is real world data and not a controlled test. This is how Sphinx performed in our environment, for the particular way we use Sphinx.

The graph below displays the response time distribution for all servers and all indexes, and shows, for example, that 60% of queries complete within 0.01 secs, 80% within 0.1 secs and 99% within 0.5 secs. Response times tend to occur in 3 bands (corresponding to the peaks in the frequency graph) – <0.001 sec, 0.03 sec and 0.3secs, which partly relates to the number of disk accesses required to fulfil a request. At 0.001 sec, all data is in memory, while at 0.3 secs, several disk accesses are occurring. Whilst the middle peak is not so obvious in this graph, the per-server or per-index graphs often have different distributions but still tend to have peaks at one or more of these three bands.
Sphinx Query Response Times Total for all servers, all indexes

The next observation is that query word count affects performance, but not necessarily in proportion to the number of query words, as shown in the graph below. 1-4 word queries consistently offer best performance. The 6-50 words range is consistently the slowest, most likely because the chance of finding documents with multiple matches is high so there is extra ranking effort involved. Above 50, there is presumably a higher chance of having words with few matches, which speeds up the ranking process.
Sphinx Query Response Time by Query Word Count

Finally, we see that the size of the inverted index (.spd files) also affects performance. The three graphs below show how the response time distribution tends to move to the right as the index size increases. The larger the index, the higher the chance that data will need to be re-read from disk (rather than from Sphinx-internal or system buffers/cache), hence this is not unexpected.
Sphinx Query Response Times for Index Sizes 1MB - 3MB
Sphinx Query Response Times for Index Sizes 3MB - 30MBSphinx Query Response Times for Index Sizes >30MB

Here is a PDF summary of Sphinx performance for this dataset, including many additional graphs of the data by server and by index.

Comments

MySQL Multi-Select Performance – The Sequel

Following my original post, it was suggested to me that one of the following may give better performance:

  • SELECT … UNION SELECT …
  • Using a temporary table with an index.

Well, not so.  I have added the above cases to my benchmarking script, and updated the graph as shown below.

SELECT … UNION gave all sorts of problems.  Firstly, it broke at a query set size of 1000 with the error

Can't open file: './bench/test1.frm' (errno: 24)

After a bit of searching I found that the remedy for this was to increase the MySQL open_files_limit setting (was 1024, increased to 8192).  This got it going again, only to fall over once more at a query set size of 10000, this time with the error

parser stack overflow near 'UNION SELECT ...

to which I could not find a solution.  In any case, the performance as shown in the graph is closely tracking the exponential degradation of the SELECT + OR case.  Conclusion: SELECT UNIONs are not suited for a large number of unions.  Useful when merging the results of several different SELECT statements, though.

The addition of an index to the temporary table also had no appreciable effect in this test, probably because MySQL will use the index in the main table to search while scanning through the temporary table.  Perhaps there might be an improvement for the case where the temporary table is larger than the main table – but that would imply duplicates in the temporary table.

Comments (1)

MySQL – Many-row SELECT Performance – “OR” bad, “IN” good

Consider the situation where you have a list of row IDs and you need to retrieve the data for each of the rows.  The simplest way is to make one query per row, i.e.

(A) SELECT * from data_table WHERE id=?

For a large number of rows, that results in a lot of queries.  This could be condensed into one query, such as:

(B) SELECT * from data_table WHERE id=1 OR id=2 OR id=3 …

or

(C) SELECT * from data_table WHERE id IN (1,2,3,…)

When constructing potentially large SQL statements such as these (imagine if you wanted to retrieve 1,000,000 rows), it’s important to take into account the max_allowed_packet size which restricts the length of the query.  It might be necessary to divide the data up into several blocks and make a query for each block to ensure max_allowed_packet is not exceeded.

Another approach is to create a temporary table, insert the keys of the required rows, then do a JOIN query to retrieve the data, i.e.

(D) CREATE TEMPORARY TABLE tmp ( id INT(11) );

INSERT INTO tmp (id) VALUES (1), (2), (3), …

SELECT d.* FROM data_table d JOIN tmp USING (id)

This approach is somewhat cleaner, particularly when multiple keys are involved.  With multiple keys the WHERE syntax of the prior options becomes:

WHERE (key1=x1 AND key2=y1) OR (key1=x2 AND key2=y2) …

or

WHERE (key1, key2) IN ((x1, y1), (x2, y2), …)

Under the temporary table approach, the question then arises as to how to most efficiently insert the data. A ‘LOAD DATA INFILE’ approach is the most efficient way to load a table, but here we assume this is not an option as it is not readily portable (due to security settings that differ between local and remote MySQL daemons).  The example (D) above assumes a long INSERT statement, which again may be affected by max_allowed_packet.  Other options include:

(E) Multiple single INSERTs, INSERT INTO tmp (id) VALUE (?)

(F) Multiple single INSERTs in a transaction block, begin_work .. commit

(G) Multiple single INSERTs as an array, using the DBI execute_array() function

(H) As for (G), in a transaction block.

These options were benchmarked using MySQL 5.0.45 and the results are shown in the figure below.  As would be expected, the use of single select statements scales linearly.  For small query set sizes, the setup times for the different query approaches have significant impact on the performance; as the query set size increases, three classes emerge – one group that performs similarly to single selects, another that performs much much better, and one that lives on a completely different planet (one you wouldn’t want to visit).  In summary:

  • That SELECT + IN(…) (case C) offers best performance when the query set size is above 30 or so.  It is also interesting to note that the performance of SELECT + IN(…) is very similar to using a temporary table with a single, long INSERT statement for large query set sizes, presumably because internally the IN(…) operation is essentially implemented as a temporary table.
  • That SELECT + OR (case B) is a good choice for query set size < 30
  • That SELECT + OR hits a point where performance becomes exponentially worse (not shown on the graph, for the largest data set the performance reaches 1300s per query set!  Curiously, this is elapsed time, but CPU time does not significantly increase. This suggest there are some inefficient data moves/swapping occurring).

In short, as a rule of thumb, use SELECT + OR for query sets < 30 in size, and SELECT + IN(…) otherwise.

The SELECT + OR performance is a significant result; the Perl SQL::Abstract library turns a WHERE specification such as { A => [ 1, 2, 3] } into  WHERE ( ( ( A = ? ) OR ( A = ? ) OR ( A = ? ) ) ).  It will do the same if there are 1000 options (try it – perl -MSQL::Abstract -e ‘$sql = SQL::Abstract->new; $w = $sql->where({ A => [ 1 .. 1000]}); print $w’).  Thus libraries that use SQL::Abstract, such as DBIx::Class, are similarly affected.  A perfectly reasonable approach from the library’s perspective, but potentially a significant performance hit if used in this manner.

Feel free to review my benchmarking code and tell me if I’ve got it wrong…

UPDATE Nov 19 2008:  There is a sequel post that looks at SELECT … UNION and using a temporary table with an index.

Comments (1)

« Previous entries Next Page » Next Page »