Archive for April, 2009

Perl client for Facebook’s scribe logging software

Scribe is a log aggregator, developed at Facebook and released as open source. Scribe is built on Thrift, a cross-language RPC type platform, and therefore it is possible to use scribe with any of the Thrift-supported languages. Whilst Perl is one of the supported languages, there is little in the way of working examples, so here’s how I did it:

  1. Install Thrift.
  2. Build and install FB303 perl modules
      cd thrift/contrib/fb303
      # Edit if/fb303.thrift and add the line 'namespace perl Facebook.FB303' after the other namespace declarations
      thrift --gen perl if/fb303.thrift
      sudo cp -a gen-perl/ /usr/local/lib/perl5/site_perl/5.10.0 # or wherever you keep your site perl

    This creates the modules Facebook::FB303::Constants, Facebook::FB303::FacebookService and Facebook::FB303::Types.

  3. Install Scribe.
  4. Build and install Scribe perl modules
      cd scribe
      # Edit if/scribe.thrift and add 'namespace perl Scribe.Thrift' after the other namespace declarations
      thrift -I /path/to/thrift/contrib/ --gen perl scribe.thrift
      sudo cp -a gen-perl/Scribe /usr/local/lib/perl5/site_perl/5.10.0/ # or wherever
  5. This creates the modules Scribe::Thrift::Constants, Scribe::Thrift::scribe, Scribe::Thrift::Types.

      Here is an example program that uses the client (reading one line at a time from stdin and sending to a scribe instance running locally on port 1465):

      #! /usr/bin/perl
      use Scribe::Thrift::scribe;
      use Thrift::Socket;
      use Thrift::FramedTransport;
      use Thrift::BinaryProtocol;
      use strict;
      use warnings;
      my $host = 'localhost';
      my $port = 1465;
      my $cat = $ARGV[0] || 'test';
      my $socket = Thrift::Socket->new($host, $port);
      my $transport = Thrift::FramedTransport->new($socket);
      my $proto = Thrift::BinaryProtocol->new($transport);
      my $client = Scribe::Thrift::scribeClient->new($proto, $proto);
      my $le = Scribe::Thrift::LogEntry->new({ category => $cat });
      while (my $line = <>) {
          my $result = $client->Log([ $le ]);
          if ($result == Scribe::Thrift::ResultCode::TRY_LATER) {
      	print STDERR "TRY_LATER\n";
          elsif ($result != Scribe::Thrift::ResultCode::OK) {
      	print STDERR "Unknown result code: $result\n";

      UPDATE Log::Dispatch::Scribe is now available on CPAN. Also works with Log::Log4perl. Note though, you still need to install Thrift and Scribe perl modules as described above.


Sphinx Search Engine Performance

The following is a summary of some real-world data collected from the Sphinx query logs on a cluster of 15 servers. Each server runs its own copy of Sphinx, Apache, a busy web application, MySQL and miscellaneous services.

The dataset contains 453 million query log instances from 180 Sphinx indexes, collected over several months, using Sphinx version 0.9.8 on Linux kernel 2.6.18. The servers are all Dell PowerEdge 1950 with Quad Core Intel® Xeon® E5335, 2×4MB Cache, 2.0GHz, 1333MHz FSB, SATA drives, 7200rpm.

Keep in mind, though, that this is real world data and not a controlled test. This is how Sphinx performed in our environment, for the particular way we use Sphinx.

The graph below displays the response time distribution for all servers and all indexes, and shows, for example, that 60% of queries complete within 0.01 secs, 80% within 0.1 secs and 99% within 0.5 secs. Response times tend to occur in 3 bands (corresponding to the peaks in the frequency graph) – <0.001 sec, 0.03 sec and 0.3secs, which partly relates to the number of disk accesses required to fulfil a request. At 0.001 sec, all data is in memory, while at 0.3 secs, several disk accesses are occurring. Whilst the middle peak is not so obvious in this graph, the per-server or per-index graphs often have different distributions but still tend to have peaks at one or more of these three bands.
Sphinx Query Response Times Total for all servers, all indexes

The next observation is that query word count affects performance, but not necessarily in proportion to the number of query words, as shown in the graph below. 1-4 word queries consistently offer best performance. The 6-50 words range is consistently the slowest, most likely because the chance of finding documents with multiple matches is high so there is extra ranking effort involved. Above 50, there is presumably a higher chance of having words with few matches, which speeds up the ranking process.
Sphinx Query Response Time by Query Word Count

Finally, we see that the size of the inverted index (.spd files) also affects performance. The three graphs below show how the response time distribution tends to move to the right as the index size increases. The larger the index, the higher the chance that data will need to be re-read from disk (rather than from Sphinx-internal or system buffers/cache), hence this is not unexpected.
Sphinx Query Response Times for Index Sizes 1MB - 3MB
Sphinx Query Response Times for Index Sizes 3MB - 30MBSphinx Query Response Times for Index Sizes >30MB

Here is a PDF summary of Sphinx performance for this dataset, including many additional graphs of the data by server and by index.