Archive for Perl

www.multitouchanalytics.com updates

Have updated the documentation pages on Multitouch Analytics which will hopefully make the installation process a bit clearer…

Comments

Perl client for Facebook’s scribe logging software

Scribe is a log aggregator, developed at Facebook and released as open source. Scribe is built on Thrift, a cross-language RPC type platform, and therefore it is possible to use scribe with any of the Thrift-supported languages. Whilst Perl is one of the supported languages, there is little in the way of working examples, so here’s how I did it:

  1. Install Thrift.
  2. Build and install FB303 perl modules
      cd thrift/contrib/fb303
      # Edit if/fb303.thrift and add the line 'namespace perl Facebook.FB303' after the other namespace declarations
      thrift --gen perl if/fb303.thrift
      sudo cp -a gen-perl/ /usr/local/lib/perl5/site_perl/5.10.0 # or wherever you keep your site perl
    

    This creates the modules Facebook::FB303::Constants, Facebook::FB303::FacebookService and Facebook::FB303::Types.

  3. Install Scribe.
  4. Build and install Scribe perl modules
      cd scribe
      # Edit if/scribe.thrift and add 'namespace perl Scribe.Thrift' after the other namespace declarations
      thrift -I /path/to/thrift/contrib/ --gen perl scribe.thrift
      sudo cp -a gen-perl/Scribe /usr/local/lib/perl5/site_perl/5.10.0/ # or wherever
    
  5. This creates the modules Scribe::Thrift::Constants, Scribe::Thrift::scribe, Scribe::Thrift::Types.

      Here is an example program that uses the client (reading one line at a time from stdin and sending to a scribe instance running locally on port 1465):

      #! /usr/bin/perl
      
      use Scribe::Thrift::scribe;
      use Thrift::Socket;
      use Thrift::FramedTransport;
      use Thrift::BinaryProtocol;
      use strict;
      use warnings;
      
      my $host = 'localhost';
      my $port = 1465;
      my $cat = $ARGV[0] || 'test';
      
      my $socket = Thrift::Socket->new($host, $port);
      my $transport = Thrift::FramedTransport->new($socket);
      my $proto = Thrift::BinaryProtocol->new($transport);
      
      my $client = Scribe::Thrift::scribeClient->new($proto, $proto);
      my $le = Scribe::Thrift::LogEntry->new({ category => $cat });
      
      $transport->open();
      
      while (my $line = <>) {
          $le->message($line);
          my $result = $client->Log([ $le ]);
          if ($result == Scribe::Thrift::ResultCode::TRY_LATER) {
      	print STDERR "TRY_LATER\n";
          }
          elsif ($result != Scribe::Thrift::ResultCode::OK) {
      	print STDERR "Unknown result code: $result\n";
          }
      }
      
      $transport->close();
      

      UPDATE Log::Dispatch::Scribe is now available on CPAN. Also works with Log::Log4perl. Note though, you still need to install Thrift and Scribe perl modules as described above.

Comments

MySQL – Many-row SELECT Performance – “OR” bad, “IN” good

Consider the situation where you have a list of row IDs and you need to retrieve the data for each of the rows.  The simplest way is to make one query per row, i.e.

(A) SELECT * from data_table WHERE id=?

For a large number of rows, that results in a lot of queries.  This could be condensed into one query, such as:

(B) SELECT * from data_table WHERE id=1 OR id=2 OR id=3 …

or

(C) SELECT * from data_table WHERE id IN (1,2,3,…)

When constructing potentially large SQL statements such as these (imagine if you wanted to retrieve 1,000,000 rows), it’s important to take into account the max_allowed_packet size which restricts the length of the query.  It might be necessary to divide the data up into several blocks and make a query for each block to ensure max_allowed_packet is not exceeded.

Another approach is to create a temporary table, insert the keys of the required rows, then do a JOIN query to retrieve the data, i.e.

(D) CREATE TEMPORARY TABLE tmp ( id INT(11) );

INSERT INTO tmp (id) VALUES (1), (2), (3), …

SELECT d.* FROM data_table d JOIN tmp USING (id)

This approach is somewhat cleaner, particularly when multiple keys are involved.  With multiple keys the WHERE syntax of the prior options becomes:

WHERE (key1=x1 AND key2=y1) OR (key1=x2 AND key2=y2) …

or

WHERE (key1, key2) IN ((x1, y1), (x2, y2), …)

Under the temporary table approach, the question then arises as to how to most efficiently insert the data. A ‘LOAD DATA INFILE’ approach is the most efficient way to load a table, but here we assume this is not an option as it is not readily portable (due to security settings that differ between local and remote MySQL daemons).  The example (D) above assumes a long INSERT statement, which again may be affected by max_allowed_packet.  Other options include:

(E) Multiple single INSERTs, INSERT INTO tmp (id) VALUE (?)

(F) Multiple single INSERTs in a transaction block, begin_work .. commit

(G) Multiple single INSERTs as an array, using the DBI execute_array() function

(H) As for (G), in a transaction block.

These options were benchmarked using MySQL 5.0.45 and the results are shown in the figure below.  As would be expected, the use of single select statements scales linearly.  For small query set sizes, the setup times for the different query approaches have significant impact on the performance; as the query set size increases, three classes emerge – one group that performs similarly to single selects, another that performs much much better, and one that lives on a completely different planet (one you wouldn’t want to visit).  In summary:

  • That SELECT + IN(…) (case C) offers best performance when the query set size is above 30 or so.  It is also interesting to note that the performance of SELECT + IN(…) is very similar to using a temporary table with a single, long INSERT statement for large query set sizes, presumably because internally the IN(…) operation is essentially implemented as a temporary table.
  • That SELECT + OR (case B) is a good choice for query set size < 30
  • That SELECT + OR hits a point where performance becomes exponentially worse (not shown on the graph, for the largest data set the performance reaches 1300s per query set!  Curiously, this is elapsed time, but CPU time does not significantly increase. This suggest there are some inefficient data moves/swapping occurring).

In short, as a rule of thumb, use SELECT + OR for query sets < 30 in size, and SELECT + IN(…) otherwise.

The SELECT + OR performance is a significant result; the Perl SQL::Abstract library turns a WHERE specification such as { A => [ 1, 2, 3] } into  WHERE ( ( ( A = ? ) OR ( A = ? ) OR ( A = ? ) ) ).  It will do the same if there are 1000 options (try it – perl -MSQL::Abstract -e ‘$sql = SQL::Abstract->new; $w = $sql->where({ A => [ 1 .. 1000]}); print $w’).  Thus libraries that use SQL::Abstract, such as DBIx::Class, are similarly affected.  A perfectly reasonable approach from the library’s perspective, but potentially a significant performance hit if used in this manner.

Feel free to review my benchmarking code and tell me if I’ve got it wrong…

UPDATE Nov 19 2008:  There is a sequel post that looks at SELECT … UNION and using a temporary table with an index.

Comments (1)

Integrating Sphinx into Perl Applications

Sphinx is a full-text search engine (http://www.sphinxsearch.com) designed
primarily for full-text search of database content.  It has many features but in
my opinion its best assets are speed of search and scalability.

We started using Sphinx when MySQL built-in full-text search was becoming too
slow and too CPU intensive, and of questionable accuracy.  Sphinx is lightning
fast compared to MySQL and provides better results relevancy.

This note is about integration with the standalone Sphinx search server. Sphinx
also has a component (‘SphinxSE’) that runs as a MySQL 5 engine so can be used as
a direct replacement for MySQL full-text search; to use SphinxSE, standard Perl
DBI should be all that is necessary.

What you will need:

The following CPAN modules are likely to be useful:

Sphinx::Search
Sphinx::Manager
Sphinx::Config

Sphinx::Manager provides facilities to start and stop the search server and to
run the indexer.

Sphinx::Search provides the search API.

Sphinx::Config allows you to read/write the Sphinx configuration files from
code, in case you wish to maintain the configuration elsewhere (e.g. in your
database).

Putting it all together:

Running the Sphinx searchd server

Sphinx operates most efficiently if it is allowed to run persistently as a
background service.  Theoretically, you could start the Sphinx server, do a
search and then stop it on every request, with a small amount of overhead – but
here we will consider just the typical case.

Ideally you will use your operating system tools start such as daemontools,
monit or just the SysV startup scripts to start and monitor searchd, rather than
have to worry about it in your perl app.  But, if you need or want to start it
in perl:

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->start_searchd;

You should verify that the effective UID of your perl app has all of the appropriate
permissions:

  • to create and write to the PID file (see ’searchd’ section of config, ‘pid_file’)
  • to create and write to the log file (see ’searchd’/'log’)
  • to read the Sphinx database files (‘path’ in each of your ‘index’ specifications)

Adding Content to the Index

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->run_indexer('--rotate');

Sphinx gets its content for indexing directly from the database, according to
the ’sql_query’ given in the config file.  ‘run_indexer’ simply runs the command
line version of the Sphinx indexer program.  You can pass any indexer arguments
through to ‘run_indexer’; ‘–rotate’ is typical, to force searchd to start using
the newly created index without disrupting searches while indexing is
occurring.

Searching

Make sure you have a version of Sphinx::Search that is compatible with searchd.
A compatibility list is given at the top of the Sphinx::Search perldoc.
Hopefully a point will be reached where the Sphinx::Search client can support a
range of searchd versions, but for the moment that is impractical.

Sphinx::Search can be used with any logging object that supports error, warn,
info and debug methods.  In this example I have used Log::Log4perl.

  use Sphinx::Search;
  use Log::Log4perl qw(:easy);
  Log::Log4perl->easy_init($DEBUG);
  $sph = Sphinx::Search->new( log => Log::Log4perl->get_logger('sphinx.search') );
  my $results = $sph->setMatchMode(SPH_MATCH_ALL)
                    ->Query("...");

Configuring

Sphinx::Config provides the tools to read and write the Sphinx configuration file.

A typical problem is that searchd is running on a non-standard port (the default
is 3312), so how will your perl app know where to find it?  Obviously you don’t
want to hard-code port numbers in case they change…

use Sphinx::Search;
use Sphinx::Config;
use Log::Log4perl qw(:easy);

Log::Log4perl->easy_init($DEBUG);

$sph = Sphinx::Search->new( log => Log::Log4perl->get_logger(’sphinx.search’) );

# Get port from config file
$conf = Sphinx::Config->new;
$conf->parse(‘/etc/sphinx.conf’);
my $port = $conf->get(’searchd’, undef, ‘port’);

# Tell Sphinx client
$sph->setServer(‘localhost’, $port);

my $results = $sph->Query(“…”);


Enjoy

We have had a considerable amount of success using Perl and Sphinx.  I hope you
do too.


Comments

Adding Action Timings to your Catalyst Output

About a year ago, onemogin wrote an article on adding action timings to the HTML output of a Catalyst app. To do so, it was necessary to access $c->stats, which at the time was an internal object (that is, there was no published API for it) and therefore subject to change. As of Catalyst-Runtime 5.7012, $c->stats has a defined interface and returns a Catalyst::Stats object (or your own class, if you provide one) rather than the Tree::Simple object that it used to.

It’s easy to fix your code to work with 5.7012. Onemogin’s code in the end() method looked like this:

  my $tree = $c->stats();

  my $dvisit = new Tree::Simple::Visitor();
  $tree->accept($dvisit);
  $c->stash->{'action_stats'} = $dvisit->getResults();

which needs to become this:

  my @report = $c->stats->report;
  $c->stash->{action_stats}= \@report;

and your template will also need to change; here’s an example:

 
 <div id="stats">
 <table border="0" cellspacing="0" cellpadding="0">
 [% space = '&nbsp;&nbsp;' %]
 <tr><th>Action</th><th>Time</th></tr>
 [% FOREACH r=action_stats %]
 <tr><td class="description">[% space.repeat(r.0) %][% r.1 | html %]</td>
<td class="elapsed">[% UNLESS r.3 %]+[% END %][% r.2 %]s</td></tr>
 [% END %]
 </table>
 </div>

to produce an end result such as:

ActionTime
/default0.005895s
  -> /look_left0.00091s
    - starting critical bit+0.000479s
    - critical bit complete+0.000208s
  -> /look_right0.000587s
  -> /look_left0.000799s
    - starting critical bit+0.000441s
    - critical bit complete+0.000169s
  -> /cross_over0.001766s
/end0.000462s

Here’s the bit of controller code that generated the example:

sub default : Private {
    my ( $self, $c ) = @_;

    $c->forward('look_left');
    $c->forward('look_right');
    $c->forward('look_left');
    $c->forward('cross_over');
}

sub look_left : Private {
    my ( $self, $c ) = @_;
    for (1 .. 100) {};
    $c->stats->profile("starting critical bit");
    for (1 .. 100) {};
    $c->stats->profile("critical bit complete");
}

sub look_right : Private {
    for (1 .. 1000) {};
}
sub cross_over : Private {
    for (1 .. 10000) {};
}

sub end : ActionClass('RenderView') {
    my ( $self, $c ) = @_;
    my @report = $c->stats->report;
    $c->stash->{action_stats}= \@report;
}

Comments (1)

Unicode Character Classes

These are the Unicode “General Category” character class names used in regular expression matching, e.g. in Perl, \pP or \p{Punctuation} to match all Unicode characters having the “punctuation” property.

Expression Syntax Long Name Description
Letter :L Letter Matches any letter, Ll | Lm | Lo | Lt | Lu
Uppercase letter :Lu Uppercase_Letter Matches any one capital letter. For example, :Luhe matches “The” but not “the”.
Lowercase letter :Ll Lowercase_Letter Matches any one lower case letter. For example, :Llhe matches “the” but not “The”.
Title case letter :Lt Titlecase_Letter Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz.
Modifier letter :Lm Modifier_Letter Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter.
Other letter :Lo Other_Letter Matches other letters, such as gothic letter ahsa.
Cased letter :LC Cased_Letter Matches any letter with case, Ll | Lt | Lu
Mark :M Mark Matches any mark, Mc | Me | Mn
Non-spacing mark :Mn Nonspacing_Mark Matches non-spacing marks.
Combining mark :Mc Spacing_Mark Matches combining marks.
Enclosing mark :Me Enclosing_Mark Matches enclosing marks.
Number :N Number Matches any number, Nd | Nl | No
Decimal digit :Nd Decimal_Number Matches decimal digits such as 0-9 and their full-width equivalents.
Letter digit :Nl Letter_Number Matches letter digits such as roman numerals and ideographic number zero.
Other digit :No Other_Number Matches other digits such as old italic number one.
Punctuation :P Punctuation Matches any puncutation, Pc | Pd | Pe | Pf | Pi | Po | Ps
Connector punctuation :P c Connector_Punctuation Matches the underscore or underline mark.
Dash punctuation :P d Dash_Punctuation Matches the dash mark.
Open punctuation :P s Open_Punctuation Matches opening punctuation such as open brackets and braces.
Close punctuation :P e Close_Punctuation Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation :P i Initial_Punctuation Matches initial double quotation marks.
Final quote punctuation :P f Final_Punctuation Matches single quotation marks and ending double quotation marks.
Other punctuation :P o Other_Punctuation Matches commas (,), ?, “, !, @, #, %, &, *, \, colons (:), semi-colons (;), ‘, and /.
Symbol :S Symbol Matches any symbol, Sc | Sk | Sm | So
Math symbol :Sm Math_Symbol Matches +, =, ~, |, <, and >.
Currency symbol :Sc Currency_Symbol Matches $ and other currency symbols.
Modifier symbol :Sk Modifier_Symbol Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol :So Other_Symbol Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign.
Separator :Z Separator Matches any separator, Zl | Zp | Zs
Paragraph separator :Zp Paragraph_Separator Matches the Unicode character U+2029.
Space separator :Zs Space_Separator Matches blanks.
Line separator :Zl Line_Separator Matches the Unicode character U+2028.
Other control :Cc Control Matches end of line.
Other format :Cf Format Formatting control character such as the bidirectional control characters.
Surrogate :Cs Surrogate Matches one half of a surrogate pair.
Other private-use :Co Private_Use Matches any character from the private-use area.
Other not assigned :Cn Unassigned Matches characters that do not map to a Unicode character.

References:

unicode.org

Unicode Character Properties

Unicode Regular Expressions

Unicode Property Aliases 

Perl Regular Expressions

PCRE

Comments