MySQL Multi-Select Performance - The Sequel

Following my original post, it was suggested to me that one of the following may give better performance:

  • SELECT … UNION SELECT …
  • Using a temporary table with an index.

Well, not so.  I have added the above cases to my benchmarking script, and updated the graph as shown below.

SELECT … UNION gave all sorts of problems.  Firstly, it broke at a query set size of 1000 with the error

Can't open file: './bench/test1.frm' (errno: 24)

After a bit of searching I found that the remedy for this was to increase the MySQL open_files_limit setting (was 1024, increased to 8192).  This got it going again, only to fall over once more at a query set size of 10000, this time with the error

parser stack overflow near 'UNION SELECT ...

to which I could not find a solution.  In any case, the performance as shown in the graph is closely tracking the exponential degradation of the SELECT + OR case.  Conclusion: SELECT UNIONs are not suited for a large number of unions.  Useful when merging the results of several different SELECT statements, though.

The addition of an index to the temporary table also had no appreciable effect in this test, probably because MySQL will use the index in the main table to search while scanning through the temporary table.  Perhaps there might be an improvement for the case where the temporary table is larger than the main table - but that would imply duplicates in the temporary table.

Comments (1)

Delicious Bookmark this on Delicious submit to reddit

MySQL - Many-row SELECT Performance - “OR” bad, “IN” good

Consider the situation where you have a list of row IDs and you need to retrieve the data for each of the rows.  The simplest way is to make one query per row, i.e.

(A) SELECT * from data_table WHERE id=?

For a large number of rows, that results in a lot of queries.  This could be condensed into one query, such as:

(B) SELECT * from data_table WHERE id=1 OR id=2 OR id=3 …

or

(C) SELECT * from data_table WHERE id IN (1,2,3,…)

When constructing potentially large SQL statements such as these (imagine if you wanted to retrieve 1,000,000 rows), it’s important to take into account the max_allowed_packet size which restricts the length of the query.  It might be necessary to divide the data up into several blocks and make a query for each block to ensure max_allowed_packet is not exceeded.

Another approach is to create a temporary table, insert the keys of the required rows, then do a JOIN query to retrieve the data, i.e.

(D) CREATE TEMPORARY TABLE tmp ( id INT(11) );

INSERT INTO tmp (id) VALUES (1), (2), (3), …

SELECT d.* FROM data_table d JOIN tmp USING (id)

This approach is somewhat cleaner, particularly when multiple keys are involved.  With multiple keys the WHERE syntax of the prior options becomes:

WHERE (key1=x1 AND key2=y1) OR (key1=x2 AND key2=y2) …

or

WHERE (key1, key2) IN ((x1, y1), (x2, y2), …)

Under the temporary table approach, the question then arises as to how to most efficiently insert the data. A ‘LOAD DATA INFILE’ approach is the most efficient way to load a table, but here we assume this is not an option as it is not readily portable (due to security settings that differ between local and remote MySQL daemons).  The example (D) above assumes a long INSERT statement, which again may be affected by max_allowed_packet.  Other options include:

(E) Multiple single INSERTs, INSERT INTO tmp (id) VALUE (?)

(F) Multiple single INSERTs in a transaction block, begin_work .. commit

(G) Multiple single INSERTs as an array, using the DBI execute_array() function

(H) As for (G), in a transaction block.

These options were benchmarked using MySQL 5.0.45 and the results are shown in the figure below.  As would be expected, the use of single select statements scales linearly.  For small query set sizes, the setup times for the different query approaches have significant impact on the performance; as the query set size increases, three classes emerge - one group that performs similarly to single selects, another that performs much much better, and one that lives on a completely different planet (one you wouldn’t want to visit).  In summary:

  • That SELECT + IN(…) (case C) offers best performance when the query set size is above 30 or so.  It is also interesting to note that the performance of SELECT + IN(…) is very similar to using a temporary table with a single, long INSERT statement for large query set sizes, presumably because internally the IN(…) operation is essentially implemented as a temporary table.
  • That SELECT + OR (case B) is a good choice for query set size < 30
  • That SELECT + OR hits a point where performance becomes exponentially worse (not shown on the graph, for the largest data set the performance reaches 1300s per query set!  Curiously, this is elapsed time, but CPU time does not significantly increase. This suggest there are some inefficient data moves/swapping occurring).

In short, as a rule of thumb, use SELECT + OR for query sets < 30 in size, and SELECT + IN(…) otherwise.

The SELECT + OR performance is a significant result; the Perl SQL::Abstract library turns a WHERE specification such as { A => [ 1, 2, 3] } into  WHERE ( ( ( A = ? ) OR ( A = ? ) OR ( A = ? ) ) ).  It will do the same if there are 1000 options (try it - perl -MSQL::Abstract -e ‘$sql = SQL::Abstract->new; $w = $sql->where({ A => [ 1 .. 1000]}); print $w’).  Thus libraries that use SQL::Abstract, such as DBIx::Class, are similarly affected.  A perfectly reasonable approach from the library’s perspective, but potentially a significant performance hit if used in this manner.

Feel free to review my benchmarking code and tell me if I’ve got it wrong…

UPDATE Nov 19 2008:  There is a sequel post that looks at SELECT … UNION and using a temporary table with an index.

Comments (1)

Delicious Bookmark this on Delicious submit to reddit

Integrating Sphinx into Perl Applications

Sphinx is a full-text search engine (http://www.sphinxsearch.com) designed
primarily for full-text search of database content.  It has many features but in
my opinion its best assets are speed of search and scalability.

We started using Sphinx when MySQL built-in full-text search was becoming too
slow and too CPU intensive, and of questionable accuracy.  Sphinx is lightning
fast compared to MySQL and provides better results relevancy.

This note is about integration with the standalone Sphinx search server. Sphinx
also has a component (’SphinxSE’) that runs as a MySQL 5 engine so can be used as
a direct replacement for MySQL full-text search; to use SphinxSE, standard Perl
DBI should be all that is necessary.

What you will need:

The following CPAN modules are likely to be useful:

Sphinx::Search
Sphinx::Manager
Sphinx::Config

Sphinx::Manager provides facilities to start and stop the search server and to
run the indexer.

Sphinx::Search provides the search API.

Sphinx::Config allows you to read/write the Sphinx configuration files from
code, in case you wish to maintain the configuration elsewhere (e.g. in your
database).

Putting it all together:

Running the Sphinx searchd server

Sphinx operates most efficiently if it is allowed to run persistently as a
background service.  Theoretically, you could start the Sphinx server, do a
search and then stop it on every request, with a small amount of overhead - but
here we will consider just the typical case.

Ideally you will use your operating system tools start such as daemontools,
monit or just the SysV startup scripts to start and monitor searchd, rather than
have to worry about it in your perl app.  But, if you need or want to start it
in perl:

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->start_searchd;

You should verify that the effective UID of your perl app has all of the appropriate
permissions:

  • to create and write to the PID file (see ’searchd’ section of config, ‘pid_file’)
  • to create and write to the log file (see ’searchd’/'log’)
  • to read the Sphinx database files (’path’ in each of your ‘index’ specifications)

Adding Content to the Index

  use Sphinx::Manager;
  my $mgr = Sphinx::Manager->new({ config_file => ’/etc/sphinx.conf’ });
  $mgr->run_indexer('--rotate');

Sphinx gets its content for indexing directly from the database, according to
the ’sql_query’ given in the config file.  ‘run_indexer’ simply runs the command
line version of the Sphinx indexer program.  You can pass any indexer arguments
through to ‘run_indexer’; ‘–rotate’ is typical, to force searchd to start using
the newly created index without disrupting searches while indexing is
occurring.

Searching

Make sure you have a version of Sphinx::Search that is compatible with searchd.
A compatibility list is given at the top of the Sphinx::Search perldoc.
Hopefully a point will be reached where the Sphinx::Search client can support a
range of searchd versions, but for the moment that is impractical.

Sphinx::Search can be used with any logging object that supports error, warn,
info and debug methods.  In this example I have used Log::Log4perl.

  use Sphinx::Search;
  use Log::Log4perl qw(:easy);
  Log::Log4perl->easy_init($DEBUG);
  $sph = Sphinx::Search->new( log => Log::Log4perl->get_logger('sphinx.search') );
  my $results = $sph->setMatchMode(SPH_MATCH_ALL)
                    ->Query("...");

Configuring

Sphinx::Config provides the tools to read and write the Sphinx configuration file.

A typical problem is that searchd is running on a non-standard port (the default
is 3312), so how will your perl app know where to find it?  Obviously you don’t
want to hard-code port numbers in case they change…

use Sphinx::Search;
use Sphinx::Config;
use Log::Log4perl qw(:easy);

Log::Log4perl->easy_init($DEBUG);

$sph = Sphinx::Search->new( log => Log::Log4perl->get_logger(’sphinx.search’) );

# Get port from config file
$conf = Sphinx::Config->new;
$conf->parse(’/etc/sphinx.conf’);
my $port = $conf->get(’searchd’, undef, ‘port’);

# Tell Sphinx client
$sph->setServer(’localhost’, $port);

my $results = $sph->Query(”…”);


Enjoy

We have had a considerable amount of success using Perl and Sphinx.  I hope you
do too.


Comments

Delicious Bookmark this on Delicious submit to reddit

Adding Action Timings to your Catalyst Output

About a year ago, onemogin wrote an article on adding action timings to the HTML output of a Catalyst app. To do so, it was necessary to access $c->stats, which at the time was an internal object (that is, there was no published API for it) and therefore subject to change. As of Catalyst-Runtime 5.7012, $c->stats has a defined interface and returns a Catalyst::Stats object (or your own class, if you provide one) rather than the Tree::Simple object that it used to.

It’s easy to fix your code to work with 5.7012. Onemogin’s code in the end() method looked like this:

  my $tree = $c->stats();

  my $dvisit = new Tree::Simple::Visitor();
  $tree->accept($dvisit);
  $c->stash->{'action_stats'} = $dvisit->getResults();

which needs to become this:

  my @report = $c->stats->report;
  $c->stash->{action_stats}= \@report;

and your template will also need to change; here’s an example:

 
 <div id="stats">
 <table border="0" cellspacing="0" cellpadding="0">
 [% space = '&nbsp;&nbsp;' %]
 <tr><th>Action</th><th>Time</th></tr>
 [% FOREACH r=action_stats %]
 <tr><td class="description">[% space.repeat(r.0) %][% r.1 | html %]</td>
<td class="elapsed">[% UNLESS r.3 %]+[% END %][% r.2 %]s</td></tr>
 [% END %]
 </table>
 </div>

to produce an end result such as:

ActionTime
/default0.005895s
  -> /look_left0.00091s
    - starting critical bit+0.000479s
    - critical bit complete+0.000208s
  -> /look_right0.000587s
  -> /look_left0.000799s
    - starting critical bit+0.000441s
    - critical bit complete+0.000169s
  -> /cross_over0.001766s
/end0.000462s

Here’s the bit of controller code that generated the example:

sub default : Private {
    my ( $self, $c ) = @_;

    $c->forward('look_left');
    $c->forward('look_right');
    $c->forward('look_left');
    $c->forward('cross_over');
}

sub look_left : Private {
    my ( $self, $c ) = @_;
    for (1 .. 100) {};
    $c->stats->profile("starting critical bit");
    for (1 .. 100) {};
    $c->stats->profile("critical bit complete");
}

sub look_right : Private {
    for (1 .. 1000) {};
}
sub cross_over : Private {
    for (1 .. 10000) {};
}

sub end : ActionClass('RenderView') {
    my ( $self, $c ) = @_;
    my @report = $c->stats->report;
    $c->stash->{action_stats}= \@report;
}

Comments (1)

Delicious Bookmark this on Delicious submit to reddit

MSIE Cookies Bite Back!

Here we are in 2008. We build computers with RAM measured in GB and disk in TB. I just discovered (the hard way) that Microsoft Internet Explorer can only handle 4096 bytes of cookies for a page in JavaScript. Total. Not each. Total.

Worse, if the cookies on your page exceed this limit and you try to read the cookies using document.cookie, you don’t just get some of the cookies or a set that is truncated to 4096 bytes; you get NOTHING.

From the Microsoft Knowledge Base: “For one domain name, each cookie is limited to 4,096 bytes. This total can exist as one name-value pair of 4 kilobytes (KB) or as up to 20 name-value pairs that total 4 KB. … If you use the document.cookie property to retrieve the cookie on the client side, the document.cookie property can retrieve only 4,096 bytes. This byte total can be one name-value pair of 4 KB, or it can be up to 20 name-value pairs that have a total size of 4 KB.”

Stack that up against RFC 2965, which says:

   ...general-use

   user agents SHOULD provide each of the following minimum capabilities

   individually, although not necessarily simultaneously:      *  at least 300 cookies

*  at least 4096 bytes per cookie (as measured by the characters

         that comprise the cookie non-terminal in the syntax description

         of the Set-Cookie2 header, and as received in the Set-Cookie2

         header)

*  at least 20 cookies per unique host or domain name

User agents created for specific purposes or for limited-capacity

   devices SHOULD provide at least 20 cookies of 4096 bytes, to ensure

   that the user can interact with a session-based origin server.

According to the references, this problem applies up to MSIE 6.0, but testing shows it is still a problem in IE 7.

Needless to say, this is only a problem in IE.  Firefox and Safari, although they presumably have some limit, do not suffer the same ridiculously small bound.

Test it yourself; here is a simple cookie limit test page containing a script that sets 10 cookies, each of about 72 bytes, printing document.cookies at each iteration. On first visit, the cookies disappear at iteration 6, and on subsequent visits at iteration 1 (until you clear cookies or close your browser).

I wonder how many shopping carts this has broken.

References:

Comments

Delicious Bookmark this on Delicious submit to reddit

Page Load Times and Visitor Abandonment

This is some 2005 data, courtesy http://www.marketingexperiments.com/improving-website-conversion/page-weight.html and http://www.emarketer.com/

Visitor Abandonment
Page Load Time Percent of Users
Continuing to Wait
10 seconds 84%
15 seconds 51%
20 seconds 26%
30 seconds 5%

Check box What You Need To UNDERSTAND: You will lose nearly half your visitors if they have to wait longer than 15 seconds for a page to load. Only 5% of visitors will wait longer than 30 seconds.

In 2008, where most of the population is on broadband, I expect that visitors are less patient than ever.

Update:  2006 data, quoted from http://www.avactis.com/forums/index.php?showtopic=238 : “The research shows that four seconds is the maximum length of time an average online shopper will wait for a Web page to load before abandoning one retail site and moving on to another.”

 

Comments

Delicious Bookmark this on Delicious submit to reddit

MP3 to WAV Conversion on Linux

MP3 to WAV conversion is remarkably simple:

mpg123 -w out.wav in.mp3

For the purpose of writing an audio CD, a sample rate of 44100 Hz and stereo output are essential:

mpg123 --stereo -r 44100 -w out.wav in.mp3

And then to write the WAV files to a CD:

cdrecord -audio -pad *.wav

Comments

Delicious Bookmark this on Delicious submit to reddit

IE Cookie Handling Policies

This is a summary of Internet Explorer settings for handling cookies, under the so-called “Privacy” options; IE6 and IE7 are the same, although some of the wording has changed in the descriptions. It’s important to keep these in mind when issuing cookies. The Wikipedia article on HTTP Cookies outlines some of the alternatives.

  • Block All Cookies
    Blocks all cookies from all web sites from being accepted, and won’t send any existing cookies.Should be renamed “Unusable”.
  • High
    Blocks all cookies from websites that do not carry a compact privacy policy (P3P) and cookies that contain personally identifiable (contact) information. A November 2007 study shows that only about 4% of sites use P3P, so this security setting is almost as unusable as “Block All Cookies”.
  • Medium High
    Same as “High” for 3rd party cookies. Also blocks first party cookies that contain personally identifiable information.
  • Medium
    As above, but rather than “blocking” first party cookies that contain personally identifiable information, it only “restricts” them. Efforts to find the difference between “block” and “restrict” have so far been fruitless. It may mean that cookies are accepted but not sent (how useful would that be?), or that such cookies can only be used in the same web page that created them (i.e. a restriction on the domain/path components of the cookie), of that the cookies are not kept beyond the current session.
  • Low
    Same as “High” for 3rd party cookies. No restrictions on first party cookies.
  • Accept All Cookies
    No restrictions.

In contrast, Safari (MacIntosh) allows the simple options: Accept Cookies Always/Never/Only sites you navigate to. (i.e. Always/Never/Only First Party). Firefox by default allows all cookies except where specific exceptions have been defined. There do not seem to be any Firefox extensions which emulate the IE or Safari behaviour - which perhaps places into perspective the real threat that third party cookies are(n’t) in general.

References

P3P Usage Survey

Comments

Delicious Bookmark this on Delicious submit to reddit

Sphinx::Search 0.08 released to CPAN

I have just uploaded to CPAN the latest version of Sphinx::Search, the Perl API for the Sphinx Search Engine.

Search for Sphinx::Search on CPAN to get the latest.

Version 0.08 is suitable for Sphinx 0.9.8-svn-r871 and later (currently r909). This version fixes a couple of bugs related to error checking.

I have been asked a few times what makes Sphinx::Search different from the Perl API that comes bundled in the contrib directory of the Sphinx distribution. The bundled Sphinx.pm was used as the starting point of Sphinx::Search. Maintenance of that version appears to have lapsed at sphinx-0.9.7, so many of the newer API calls are not available there. Sphinx::Search is mostly compatible with the old Sphinx.pm except:

  • On failure, Sphinx::Search returns undef rather than 0 or -1.
  • Sphinx::Search ’Set’ functions are cascadable, e.g. you can do
    Sphinx::Search->new ->SetMatchMode(SPH_MATCH_ALL) ->SetSortMode(SPH_SORT_RELEVANCE) ->Query("search terms")
  • Sphinx::Search also provides documentation and unit tests, which were the main motivations for branching from the earlier work.

Sphinx has proven to be a very efficient and better quality search engine than the built-in MySQL full text search. It is an order of magnitude faster for large data sets and provides better options for controlling search result relevancy.

Comments (1)

Delicious Bookmark this on Delicious submit to reddit

Google Searches per Day

As of August 2007, Google is handling 1200 Million searches per day on average worldwide, according to a Clickz article reporting on Comscore data. Yahoo is a long way behind at 275 Million, and MSN at 70 Million. Baidu (a Chinese search engine) beats MSN, coming in at 105 Million.

2006 figures for the US only put Google at 91 Million searches per day.

In June 2007, Wikipedia received an average of 55.6 Million referrals per day from Google. Guessing that each Google search results in 2 click-throughs on average, that means Wikipedia is getting about 2% of Google’s organic traffic. Wikipedia is ranked #8 in Alexa’s traffic rankings, behind Yahoo, Google, YouTube, Live & MSN, Myspace and Facebook. Given that these probably get most of their traffic directly rather than via search engines, Wikipedia may be the most-referred site from Google.

Alexa’s top 50 is dominated by search engines and various forms of social networking sites, with Ebay, Amazon and a few porn sites thrown in.

Google lists its most popular search terms on a daily basis in its Hot Trends area. Leading up to Thanksgiving in the US, about 10% of the top 100 search terms contained the word “turkey” - turkey cooking time, turkey recipe, how long to cook a turkey, roasting turkey, turkey thermometer, turkey soup, and more. About 40% of the most popular queries related to Thanksgiving in some way.

References:

Clickz article: Worldwide Internet: Now Serving 61 Billion Searches per Month

Alexa Top 500
Wikimedia page views

Search Engine Watch 2006 data

Comments

Delicious Bookmark this on Delicious submit to reddit

« Previous entries