Archive for MySQL

MySQL Multi-Select Performance – The Sequel

Following my original post, it was suggested to me that one of the following may give better performance:

  • SELECT … UNION SELECT …
  • Using a temporary table with an index.

Well, not so.  I have added the above cases to my benchmarking script, and updated the graph as shown below.

SELECT … UNION gave all sorts of problems.  Firstly, it broke at a query set size of 1000 with the error

Can't open file: './bench/test1.frm' (errno: 24)

After a bit of searching I found that the remedy for this was to increase the MySQL open_files_limit setting (was 1024, increased to 8192).  This got it going again, only to fall over once more at a query set size of 10000, this time with the error

parser stack overflow near 'UNION SELECT ...

to which I could not find a solution.  In any case, the performance as shown in the graph is closely tracking the exponential degradation of the SELECT + OR case.  Conclusion: SELECT UNIONs are not suited for a large number of unions.  Useful when merging the results of several different SELECT statements, though.

The addition of an index to the temporary table also had no appreciable effect in this test, probably because MySQL will use the index in the main table to search while scanning through the temporary table.  Perhaps there might be an improvement for the case where the temporary table is larger than the main table – but that would imply duplicates in the temporary table.

Comments (1)

MySQL – Many-row SELECT Performance – “OR” bad, “IN” good

Consider the situation where you have a list of row IDs and you need to retrieve the data for each of the rows.  The simplest way is to make one query per row, i.e.

(A) SELECT * from data_table WHERE id=?

For a large number of rows, that results in a lot of queries.  This could be condensed into one query, such as:

(B) SELECT * from data_table WHERE id=1 OR id=2 OR id=3 …

or

(C) SELECT * from data_table WHERE id IN (1,2,3,…)

When constructing potentially large SQL statements such as these (imagine if you wanted to retrieve 1,000,000 rows), it’s important to take into account the max_allowed_packet size which restricts the length of the query.  It might be necessary to divide the data up into several blocks and make a query for each block to ensure max_allowed_packet is not exceeded.

Another approach is to create a temporary table, insert the keys of the required rows, then do a JOIN query to retrieve the data, i.e.

(D) CREATE TEMPORARY TABLE tmp ( id INT(11) );

INSERT INTO tmp (id) VALUES (1), (2), (3), …

SELECT d.* FROM data_table d JOIN tmp USING (id)

This approach is somewhat cleaner, particularly when multiple keys are involved.  With multiple keys the WHERE syntax of the prior options becomes:

WHERE (key1=x1 AND key2=y1) OR (key1=x2 AND key2=y2) …

or

WHERE (key1, key2) IN ((x1, y1), (x2, y2), …)

Under the temporary table approach, the question then arises as to how to most efficiently insert the data. A ‘LOAD DATA INFILE’ approach is the most efficient way to load a table, but here we assume this is not an option as it is not readily portable (due to security settings that differ between local and remote MySQL daemons).  The example (D) above assumes a long INSERT statement, which again may be affected by max_allowed_packet.  Other options include:

(E) Multiple single INSERTs, INSERT INTO tmp (id) VALUE (?)

(F) Multiple single INSERTs in a transaction block, begin_work .. commit

(G) Multiple single INSERTs as an array, using the DBI execute_array() function

(H) As for (G), in a transaction block.

These options were benchmarked using MySQL 5.0.45 and the results are shown in the figure below.  As would be expected, the use of single select statements scales linearly.  For small query set sizes, the setup times for the different query approaches have significant impact on the performance; as the query set size increases, three classes emerge – one group that performs similarly to single selects, another that performs much much better, and one that lives on a completely different planet (one you wouldn’t want to visit).  In summary:

  • That SELECT + IN(…) (case C) offers best performance when the query set size is above 30 or so.  It is also interesting to note that the performance of SELECT + IN(…) is very similar to using a temporary table with a single, long INSERT statement for large query set sizes, presumably because internally the IN(…) operation is essentially implemented as a temporary table.
  • That SELECT + OR (case B) is a good choice for query set size < 30
  • That SELECT + OR hits a point where performance becomes exponentially worse (not shown on the graph, for the largest data set the performance reaches 1300s per query set!  Curiously, this is elapsed time, but CPU time does not significantly increase. This suggest there are some inefficient data moves/swapping occurring).

In short, as a rule of thumb, use SELECT + OR for query sets < 30 in size, and SELECT + IN(…) otherwise.

The SELECT + OR performance is a significant result; the Perl SQL::Abstract library turns a WHERE specification such as { A => [ 1, 2, 3] } into  WHERE ( ( ( A = ? ) OR ( A = ? ) OR ( A = ? ) ) ).  It will do the same if there are 1000 options (try it – perl -MSQL::Abstract -e ‘$sql = SQL::Abstract->new; $w = $sql->where({ A => [ 1 .. 1000]}); print $w’).  Thus libraries that use SQL::Abstract, such as DBIx::Class, are similarly affected.  A perfectly reasonable approach from the library’s perspective, but potentially a significant performance hit if used in this manner.

Feel free to review my benchmarking code and tell me if I’ve got it wrong…

UPDATE Nov 19 2008:  There is a sequel post that looks at SELECT … UNION and using a temporary table with an index.

Comments (1)