Integrating Sphinx into Perl Applications

Sphinx is a full-text search engine (http://www.sphinxsearch.com) designed
primarily for full-text search of database content.  It has many features but in
my opinion its best assets are speed of search and scalability.

We started using Sphinx when MySQL built-in full-text search was becoming too
slow and too CPU intensive, and of questionable accuracy.  Sphinx is lightning
fast compared to MySQL and provides better results relevancy.

This note is about integration with the standalone Sphinx search server. Sphinx
also has a component (‘SphinxSE’) that runs as a MySQL 5 engine so can be used as
a direct replacement for MySQL full-text search; to use SphinxSE, standard Perl
DBI should be all that is necessary.

What you will need:

The following CPAN modules are likely to be useful:

Sphinx::Search
Sphinx::Manager
Sphinx::Config

Sphinx::Manager provides facilities to start and stop the search server and to
run the indexer.

Sphinx::Search provides the search API.

Sphinx::Config allows you to read/write the Sphinx configuration files from
code, in case you wish to maintain the configuration elsewhere (e.g. in your
database).

Putting it all together:

Running the Sphinx searchd server

Sphinx operates most efficiently if it is allowed to run persistently as a
background service.  Theoretically, you could start the Sphinx server, do a
search and then stop it on every request, with a small amount of overhead – but
here we will consider just the typical case.

Ideally you will use your operating system tools start such as daemontools,
monit or just the SysV startup scripts to start and monitor searchd, rather than
have to worry about it in your perl app.  But, if you need or want to start it
in perl:

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->start_searchd;

You should verify that the effective UID of your perl app has all of the appropriate
permissions:

  • to create and write to the PID file (see ’searchd’ section of config, ‘pid_file’)
  • to create and write to the log file (see ’searchd’/'log’)
  • to read the Sphinx database files (‘path’ in each of your ‘index’ specifications)

Adding Content to the Index

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->run_indexer('--rotate');

Sphinx gets its content for indexing directly from the database, according to
the ’sql_query’ given in the config file.  ‘run_indexer’ simply runs the command
line version of the Sphinx indexer program.  You can pass any indexer arguments
through to ‘run_indexer’; ‘–rotate’ is typical, to force searchd to start using
the newly created index without disrupting searches while indexing is
occurring.

Searching

Make sure you have a version of Sphinx::Search that is compatible with searchd.
A compatibility list is given at the top of the Sphinx::Search perldoc.
Hopefully a point will be reached where the Sphinx::Search client can support a
range of searchd versions, but for the moment that is impractical.

Sphinx::Search can be used with any logging object that supports error, warn,
info and debug methods.  In this example I have used Log::Log4perl.

  use Sphinx::Search;
  use Log::Log4perl qw(:easy);
  Log::Log4perl->easy_init($DEBUG);
  $sph = Sphinx::Search->new( log => Log::Log4perl->get_logger('sphinx.search') );
  my $results = $sph->setMatchMode(SPH_MATCH_ALL)
                    ->Query("...");

Configuring

Sphinx::Config provides the tools to read and write the Sphinx configuration file.

A typical problem is that searchd is running on a non-standard port (the default
is 3312), so how will your perl app know where to find it?  Obviously you don’t
want to hard-code port numbers in case they change…

use Sphinx::Search;
use Sphinx::Config;
use Log::Log4perl qw(:easy);

Log::Log4perl->easy_init($DEBUG);

$sph = Sphinx::Search->new( log => Log::Log4perl->get_logger(’sphinx.search’) );

# Get port from config file
$conf = Sphinx::Config->new;
$conf->parse(‘/etc/sphinx.conf’);
my $port = $conf->get(’searchd’, undef, ‘port’);

# Tell Sphinx client
$sph->setServer(‘localhost’, $port);

my $results = $sph->Query(“…”);


Enjoy

We have had a considerable amount of success using Perl and Sphinx.  I hope you
do too.


Comments

Adding Action Timings to your Catalyst Output

About a year ago, onemogin wrote an article on adding action timings to the HTML output of a Catalyst app. To do so, it was necessary to access $c->stats, which at the time was an internal object (that is, there was no published API for it) and therefore subject to change. As of Catalyst-Runtime 5.7012, $c->stats has a defined interface and returns a Catalyst::Stats object (or your own class, if you provide one) rather than the Tree::Simple object that it used to.

It’s easy to fix your code to work with 5.7012. Onemogin’s code in the end() method looked like this:

  my $tree = $c->stats();

  my $dvisit = new Tree::Simple::Visitor();
  $tree->accept($dvisit);
  $c->stash->{'action_stats'} = $dvisit->getResults();

which needs to become this:

  my @report = $c->stats->report;
  $c->stash->{action_stats}= \@report;

and your template will also need to change; here’s an example:

 
 <div id="stats">
 <table border="0" cellspacing="0" cellpadding="0">
 [% space = '&nbsp;&nbsp;' %]
 <tr><th>Action</th><th>Time</th></tr>
 [% FOREACH r=action_stats %]
 <tr><td class="description">[% space.repeat(r.0) %][% r.1 | html %]</td>
<td class="elapsed">[% UNLESS r.3 %]+[% END %][% r.2 %]s</td></tr>
 [% END %]
 </table>
 </div>

to produce an end result such as:

ActionTime
/default0.005895s
  -> /look_left0.00091s
    - starting critical bit+0.000479s
    - critical bit complete+0.000208s
  -> /look_right0.000587s
  -> /look_left0.000799s
    - starting critical bit+0.000441s
    - critical bit complete+0.000169s
  -> /cross_over0.001766s
/end0.000462s

Here’s the bit of controller code that generated the example:

sub default : Private {
    my ( $self, $c ) = @_;

    $c->forward('look_left');
    $c->forward('look_right');
    $c->forward('look_left');
    $c->forward('cross_over');
}

sub look_left : Private {
    my ( $self, $c ) = @_;
    for (1 .. 100) {};
    $c->stats->profile("starting critical bit");
    for (1 .. 100) {};
    $c->stats->profile("critical bit complete");
}

sub look_right : Private {
    for (1 .. 1000) {};
}
sub cross_over : Private {
    for (1 .. 10000) {};
}

sub end : ActionClass('RenderView') {
    my ( $self, $c ) = @_;
    my @report = $c->stats->report;
    $c->stash->{action_stats}= \@report;
}

Comments (1)

MSIE Cookies Bite Back!

Here we are in 2008. We build computers with RAM measured in GB and disk in TB. I just discovered (the hard way) that Microsoft Internet Explorer can only handle 4096 bytes of cookies for a page in JavaScript. Total. Not each. Total.

Worse, if the cookies on your page exceed this limit and you try to read the cookies using document.cookie, you don’t just get some of the cookies or a set that is truncated to 4096 bytes; you get NOTHING.

From the Microsoft Knowledge Base: “For one domain name, each cookie is limited to 4,096 bytes. This total can exist as one name-value pair of 4 kilobytes (KB) or as up to 20 name-value pairs that total 4 KB. … If you use the document.cookie property to retrieve the cookie on the client side, the document.cookie property can retrieve only 4,096 bytes. This byte total can be one name-value pair of 4 KB, or it can be up to 20 name-value pairs that have a total size of 4 KB.”

Stack that up against RFC 2965, which says:

   ...general-use

   user agents SHOULD provide each of the following minimum capabilities

   individually, although not necessarily simultaneously:      *  at least 300 cookies

*  at least 4096 bytes per cookie (as measured by the characters

         that comprise the cookie non-terminal in the syntax description

         of the Set-Cookie2 header, and as received in the Set-Cookie2

         header)

*  at least 20 cookies per unique host or domain name

User agents created for specific purposes or for limited-capacity

   devices SHOULD provide at least 20 cookies of 4096 bytes, to ensure

   that the user can interact with a session-based origin server.

According to the references, this problem applies up to MSIE 6.0, but testing shows it is still a problem in IE 7.

Needless to say, this is only a problem in IE.  Firefox and Safari, although they presumably have some limit, do not suffer the same ridiculously small bound.

Test it yourself; here is a simple cookie limit test page containing a script that sets 10 cookies, each of about 72 bytes, printing document.cookies at each iteration. On first visit, the cookies disappear at iteration 6, and on subsequent visits at iteration 1 (until you clear cookies or close your browser).

I wonder how many shopping carts this has broken.

References:

Comments

Page Load Times and Visitor Abandonment

This is some 2005 data, courtesy http://www.marketingexperiments.com/improving-website-conversion/page-weight.html and http://www.emarketer.com/

Visitor Abandonment
Page Load Time Percent of Users
Continuing to Wait
10 seconds 84%
15 seconds 51%
20 seconds 26%
30 seconds 5%

Check box What You Need To UNDERSTAND: You will lose nearly half your visitors if they have to wait longer than 15 seconds for a page to load. Only 5% of visitors will wait longer than 30 seconds.

In 2008, where most of the population is on broadband, I expect that visitors are less patient than ever.

Update:  2006 data, quoted from http://www.avactis.com/forums/index.php?showtopic=238 : “The research shows that four seconds is the maximum length of time an average online shopper will wait for a Web page to load before abandoning one retail site and moving on to another.”

 

Comments

MP3 to WAV Conversion on Linux

MP3 to WAV conversion is remarkably simple:

mpg123 -w out.wav in.mp3

For the purpose of writing an audio CD, a sample rate of 44100 Hz and stereo output are essential:

mpg123 --stereo -r 44100 -w out.wav in.mp3

And then to write the WAV files to a CD:

cdrecord -audio -pad *.wav

Comments

IE Cookie Handling Policies

This is a summary of Internet Explorer settings for handling cookies, under the so-called “Privacy” options; IE6 and IE7 are the same, although some of the wording has changed in the descriptions. It’s important to keep these in mind when issuing cookies. The Wikipedia article on HTTP Cookies outlines some of the alternatives.

  • Block All Cookies
    Blocks all cookies from all web sites from being accepted, and won’t send any existing cookies.Should be renamed “Unusable”.
  • High
    Blocks all cookies from websites that do not carry a compact privacy policy (P3P) and cookies that contain personally identifiable (contact) information. A November 2007 study shows that only about 4% of sites use P3P, so this security setting is almost as unusable as “Block All Cookies”.
  • Medium High
    Same as “High” for 3rd party cookies. Also blocks first party cookies that contain personally identifiable information.
  • Medium
    As above, but rather than “blocking” first party cookies that contain personally identifiable information, it only “restricts” them. Efforts to find the difference between “block” and “restrict” have so far been fruitless. It may mean that cookies are accepted but not sent (how useful would that be?), or that such cookies can only be used in the same web page that created them (i.e. a restriction on the domain/path components of the cookie), of that the cookies are not kept beyond the current session.
  • Low
    Same as “High” for 3rd party cookies. No restrictions on first party cookies.
  • Accept All Cookies
    No restrictions.

In contrast, Safari (MacIntosh) allows the simple options: Accept Cookies Always/Never/Only sites you navigate to. (i.e. Always/Never/Only First Party). Firefox by default allows all cookies except where specific exceptions have been defined. There do not seem to be any Firefox extensions which emulate the IE or Safari behaviour – which perhaps places into perspective the real threat that third party cookies are(n’t) in general.

References

P3P Usage Survey

Comments

Sphinx::Search 0.08 released to CPAN

I have just uploaded to CPAN the latest version of Sphinx::Search, the Perl API for the Sphinx Search Engine.

Search for Sphinx::Search on CPAN to get the latest.

Version 0.08 is suitable for Sphinx 0.9.8-svn-r871 and later (currently r909). This version fixes a couple of bugs related to error checking.

I have been asked a few times what makes Sphinx::Search different from the Perl API that comes bundled in the contrib directory of the Sphinx distribution. The bundled Sphinx.pm was used as the starting point of Sphinx::Search. Maintenance of that version appears to have lapsed at sphinx-0.9.7, so many of the newer API calls are not available there. Sphinx::Search is mostly compatible with the old Sphinx.pm except:

  • On failure, Sphinx::Search returns undef rather than 0 or -1.
  • Sphinx::Search ’Set’ functions are cascadable, e.g. you can do
    Sphinx::Search->new ->SetMatchMode(SPH_MATCH_ALL) ->SetSortMode(SPH_SORT_RELEVANCE) ->Query("search terms")
  • Sphinx::Search also provides documentation and unit tests, which were the main motivations for branching from the earlier work.

Sphinx has proven to be a very efficient and better quality search engine than the built-in MySQL full text search. It is an order of magnitude faster for large data sets and provides better options for controlling search result relevancy.

Comments (1)

Google Searches per Day

As of August 2007, Google is handling 1200 Million searches per day on average worldwide, according to a Clickz article reporting on Comscore data. Yahoo is a long way behind at 275 Million, and MSN at 70 Million. Baidu (a Chinese search engine) beats MSN, coming in at 105 Million.

2006 figures for the US only put Google at 91 Million searches per day.

In June 2007, Wikipedia received an average of 55.6 Million referrals per day from Google. Guessing that each Google search results in 2 click-throughs on average, that means Wikipedia is getting about 2% of Google’s organic traffic. Wikipedia is ranked #8 in Alexa’s traffic rankings, behind Yahoo, Google, YouTube, Live & MSN, Myspace and Facebook. Given that these probably get most of their traffic directly rather than via search engines, Wikipedia may be the most-referred site from Google.

Alexa’s top 50 is dominated by search engines and various forms of social networking sites, with Ebay, Amazon and a few porn sites thrown in.

Google lists its most popular search terms on a daily basis in its Hot Trends area. Leading up to Thanksgiving in the US, about 10% of the top 100 search terms contained the word “turkey” – turkey cooking time, turkey recipe, how long to cook a turkey, roasting turkey, turkey thermometer, turkey soup, and more. About 40% of the most popular queries related to Thanksgiving in some way.

References:

Clickz article: Worldwide Internet: Now Serving 61 Billion Searches per Month

Alexa Top 500
Wikimedia page views

Search Engine Watch 2006 data

Comments

Unicode Character Classes

These are the Unicode “General Category” character class names used in regular expression matching, e.g. in Perl, \pP or \p{Punctuation} to match all Unicode characters having the “punctuation” property.

Expression Syntax Long Name Description
Letter :L Letter Matches any letter, Ll | Lm | Lo | Lt | Lu
Uppercase letter :Lu Uppercase_Letter Matches any one capital letter. For example, :Luhe matches “The” but not “the”.
Lowercase letter :Ll Lowercase_Letter Matches any one lower case letter. For example, :Llhe matches “the” but not “The”.
Title case letter :Lt Titlecase_Letter Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz.
Modifier letter :Lm Modifier_Letter Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter.
Other letter :Lo Other_Letter Matches other letters, such as gothic letter ahsa.
Cased letter :LC Cased_Letter Matches any letter with case, Ll | Lt | Lu
Mark :M Mark Matches any mark, Mc | Me | Mn
Non-spacing mark :Mn Nonspacing_Mark Matches non-spacing marks.
Combining mark :Mc Spacing_Mark Matches combining marks.
Enclosing mark :Me Enclosing_Mark Matches enclosing marks.
Number :N Number Matches any number, Nd | Nl | No
Decimal digit :Nd Decimal_Number Matches decimal digits such as 0-9 and their full-width equivalents.
Letter digit :Nl Letter_Number Matches letter digits such as roman numerals and ideographic number zero.
Other digit :No Other_Number Matches other digits such as old italic number one.
Punctuation :P Punctuation Matches any puncutation, Pc | Pd | Pe | Pf | Pi | Po | Ps
Connector punctuation :P c Connector_Punctuation Matches the underscore or underline mark.
Dash punctuation :P d Dash_Punctuation Matches the dash mark.
Open punctuation :P s Open_Punctuation Matches opening punctuation such as open brackets and braces.
Close punctuation :P e Close_Punctuation Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation :P i Initial_Punctuation Matches initial double quotation marks.
Final quote punctuation :P f Final_Punctuation Matches single quotation marks and ending double quotation marks.
Other punctuation :P o Other_Punctuation Matches commas (,), ?, “, !, @, #, %, &, *, \, colons (:), semi-colons (;), ‘, and /.
Symbol :S Symbol Matches any symbol, Sc | Sk | Sm | So
Math symbol :Sm Math_Symbol Matches +, =, ~, |, <, and >.
Currency symbol :Sc Currency_Symbol Matches $ and other currency symbols.
Modifier symbol :Sk Modifier_Symbol Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol :So Other_Symbol Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign.
Separator :Z Separator Matches any separator, Zl | Zp | Zs
Paragraph separator :Zp Paragraph_Separator Matches the Unicode character U+2029.
Space separator :Zs Space_Separator Matches blanks.
Line separator :Zl Line_Separator Matches the Unicode character U+2028.
Other control :Cc Control Matches end of line.
Other format :Cf Format Formatting control character such as the bidirectional control characters.
Surrogate :Cs Surrogate Matches one half of a surrogate pair.
Other private-use :Co Private_Use Matches any character from the private-use area.
Other not assigned :Cn Unassigned Matches characters that do not map to a Unicode character.

References:

unicode.org

Unicode Character Properties

Unicode Regular Expressions

Unicode Property Aliases 

Perl Regular Expressions

PCRE

Comments

Using User Agent statistics to detect website conversion problems

If it’s not tested, it doesn’t work.

This has been my long-standing mantra. Any bit of software – be it a website or other program – if it hasn’t been tested against certain inputs, then chances are there’s a bug in there that has never been uncovered. Browser nuances and incompatibilities are notorious for making sites look ugly, behave badly or simply not work at all. So unless you have tested your website against all of the major browsers, you can’t be sure that there isn’t a problem with the site causing some of your users to just go away in disgust.

It is impossible to test your website against every web browser out there…

In a recent test, more than 30,000 different “user agent” (browser) identifiers were found in a sample of one month of data from one site – and that excludes search engine crawlers and other Internet robots. Most, in fact, are Internet Explorer (IE) variants of some sort, with different combinations of plugins. Here is a typical identifier:

  Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; .NET CLR 2.0.50727)

‘Mozilla/4.0′ identifies the browser’s basic capabilities. MSIE 6.0 is the browser version. Windows NT 5.1 is the operating system (though it is probably really Windows XP). SV1 is “Security Version 1″ as introduced in Windows XP Service Pack 2. InfoPath is a Microsoft Office plugin. .NET CLR … is the version of the .NET Common Language Runtime that the browser supports.

This is just one example; other browsers identify themselves as Mozilla/5.0, or MSIE 7.0, or Windows NT 5.0, or with other plugins/toolbars, or with multiple versions of .NET CLR, and so on. An then there are the other browsers – Firefox, Safari, Opera, Konqueror, and many more. There is a combinatorial explosion – and it is impossible to test every single possibility.

How much testing do you need to do?

Here is the mix of browser types from three different types of sites

  • Site A: a B2B site in the electronics area, likely to be dominated by technically savvy male users, looking for products for immediate purchase
  • Site B: a general homewares and gadgets type site, representing a broad cross-section of home users
  • Site C: a women’s fashion site
Browser Site A Site B Site C
MSIE (Total) 75.208% 80.174% 76.683
Firefox (Total) 21.204% 13.834% 17.859%
Safari (Total) 1.306% 4.913% 4.285%
Opera (Total) 1.347%   0.512%
Other 2.139% 2.366% 2.670%
 
MSIE 6.0 49.760% 40.461% 40.088%
MSIE 7.0 24.984% 39.374% 35.890%
Firefox 2.0 19.613% 12.792% 16.472%
Firefox 1.5 1.025% 0.607% 0.852%
Firefox 1.0 0.540%   0.511%
Safari 4xx 0.923% 3.894% 3.518%
Safari 3xx   0.507%  
Opera 9.2 1.016%    

Clearly, testing against IE 6.0, IE 7.0 and Firefox 2.0 is mandatory, these being the dominant browsers in the market. But what about Safari? It could be up to 5% of your market that you are losing if your website does not work well with Safari. And what about all of those variants – could your site test satisfactorily for IE 6.0 but be broken if the InfoPath plugin is added to the browser? Even considering the most common browser/platform/option combinations results in a substantial number of variants, and every extra platform required for testing is an additional expense – both to maintain the test suite and to run the tests on each update. Is there another way?

A Statistical Approach

Perhaps, by looking at the distribution of user agents for all the visitors on a website, and then looking at the same for those visitors that convert, it would be possible to identify any browser types that are not converting at the expected rate. This might indicate a problem with the site for that particular browser type.

This of course assumes that users are equally likely to convert regardless of which type of browser they are using. As it turns out, this assumption is completely wrong.
The following table compares browser strings at the website point of entry with browser strings at the point of transaction for site B and shows that the difference is significant.

Browser Visitors Transactions
MSIE 7.0 39.374% 45.657%
MSIE 6.0 40.461% 38.482%
Firefox 2.0 12.792% 9.207%
Safari 4xx 3.894% 3.726%

On Site B, users are generally spending their own money; one might speculate that Firefox users represent a younger (poorer) group of users and therefore convert at a lower rate, and perhaps IE 7.0 users are either cashed up and have just bought Vista, or are people who feel more comfortable transacting on the Internet because they know their computer is running the latest up-to-date software. The statistics for Site C show Firefox users converting at half the rate of other users; in contrast, for Site A (the B2B site where users are probably spending their employer’s money, and where browsers are governed by company IT policy) the conversion rates are fairly constant across all browser types.

However, the basic idea of detecting problems using statistics is sound, providing two points of comparison can be found where the assumption holds that the distribution of browsers is the same. For example, any two points within the https conversion funnel, or any two product pages, or a product page and a category page.  Or, you could take one page and compare it against the statistics for all pages in a similar section of the site.

The analysis proceeds as follows:

  1. Take your two data sets – the HTTP request logs comprising at least the user agent strings – one data set for each URL to be compared, and calculate the frequency statistics for the user agents in each data set. Perl is an ideal tool for this analysis. You will probably need about a month of data to get reliable results.
  2. Where the frequency in both data sets for the same user agent is less than about 10, add this to a user agent called “Other” and throw away the original data. Where the frequency falls below about 10, the assumptions of the chi-square test start to fail, and since this is a distribution with a very long tail, the tail errors can end up dominating the result.
  3. Perform a chi-square test to see if the distributions are different. If they are, keep going to find out where they are different.
  4. Place the resulting data in a spreadsheet – frequencies of each sample, expected values, maximum errors between expected and actual frequencies (in absolute and/or relative terms). Discard any samples where the expected values are too low to be worth considering.
  5. Sort the result by error descending. The browsers appearing at the top are causing the most trouble.

In Action

This approach was applied a tracking code implementation. A random sample of users received a cookie. The presence of the cookie was then tested on a particular page. The expectation was that the distribution of browsers should be the same with or without the cookie. The results, however, showed that the distributions were different, and the final analysis pointed to the Safari browser running on Mac to be the problem per the picture below. Tests against other sites using the same tracking code confirmed the result. Note, though, that the site wasn’t completely broken on Safari, as some users had managed to acquire the cookie. This is the type of problem that testing directly with a browser may or may not have picked up – depending on what settings the tester’s browser happened to be using.

Summary

Test, validate, test, validate, then validate some more. It is absolutely necessary to test your website with the more popular browsers, but not always sufficient. User agent statistics provide an additional test, potentially a stronger test and definitely a cheaper test compared to manual testing, for browser-specific problems.

References:

http://blogs.msdn.com/ie/archive/2004/09/02/224902.aspx

Comments

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »