The MySQL query optimizer has different strategies available
to evaluate subqueries. For IN (or
=ANY) subqueries, the optimizer has these
choices:
Semi-join
Materialization
EXISTS strategy
For NOT IN (or
<>ALL) subqueries, the optimizer has
these choices:
Materialization
EXISTS strategy
For derived tables (subqueries in the FROM
clause) and view references, the optimizer has these choices:
Merge the derived table or view into the outer query block
Materialize the derived table or view to an internal temporary table
The following discussion provides more information about these optimization strategies.
A limitation on UPDATE and
DELETE statements that use a
subquery to modify a single table is that the optimizer does
not use semi-join or materialization subquery optimizations.
As a workaround, try rewriting them as multiple-table
UPDATE and
DELETE statements that use a
join rather than a subquery.
As of MySQL 5.6.5, the optimizer uses semi-join strategies to improve subquery execution, as described in this section.
For an inner join between two tables, the join returns a row
from one table as many times as there are matches in the
other table. But for some questions, the only information
that matters is whether there is a match, not the number of
matches. Suppose that there are tables named
class and roster that
list classes in a course curriculum and class rosters
(students enrolled in each class), respectively. To list the
classes that actually have students enrolled, you could use
this join:
SELECT class.class_num, class.class_name FROM class INNER JOIN roster WHERE class.class_num = roster.class_num;
However, the result lists each class once for each enrolled student. For the question being asked, this is unnecessary duplication of information.
Assuming that class_num is a primary key
in the class table, duplicate suppression
could be achieved by using
SELECT
DISTINCT, but it is inefficient to generate all
matching rows first only to eliminate duplicates later.
The same duplicate-free result can be obtained by using a subquery:
SELECT class_num, class_name FROM class WHERE class_num IN (SELECT class_num FROM roster);
Here, the optimizer can recognize that the
IN clause requires the subquery to return
only one instance of each class number from the
roster table. In this case, the query can
be executed as a
semi-join—that
is, an operation that returns only one instance of each row
in class that is matched by rows in
roster.
Before MySQL 5.6.6, the outer query specification was limited to simple table scans or inner joins using comma syntax, and view references were not possible. As of 5.6.6, outer join and inner join syntax is permitted in the outer query specification, and the restriction that table references must be base tables has been lifted.
In MySQL, a subquery must satisfy these criteria to be handled as a semi-join:
It must be an IN (or
=ANY) subquery that appears at the
top level of the WHERE or
ON clause, possibly as a term in an
AND expression. For example:
SELECT ... FROM ot1, ... WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_
and iit_
represent tables in the outer and inner parts of the
query, and
ioe_ and
iie_
represent expressions that refer to columns in the outer
and inner tables.
i
It must not contain a GROUP BY or
HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with
LIMIT.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
The subquery may be correlated or uncorrelated.
DISTINCT is permitted, as is
LIMIT unless ORDER BY
is also used.
If a subquery meets the preceding criteria, MySQL converts it to a semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use the temporary table to perform a join. The index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned.
Each of these strategies except Duplicate Weedout can be
enabled or disabled using the
optimizer_switch system
variable. The semijoin flag controls
whether semi-joins are used. If it is set to
on, the firstmatch,
loosescan, and
materialization flags enable finer
control over the permitted semi-join strategies. These flags
are on by default. See
Section 8.9.2, “Controlling Switchable Optimizations”.
The use of semi-join strategies is indicated in
EXPLAIN output as follows:
Semi-joined tables show up in the outer select.
EXPLAIN EXTENDED plus
SHOW WARNINGS shows the
rewritten query, which displays the semi-join structure.
From this you can get an idea about which tables were
pulled out of the semi-join. If a subquery was converted
to a semi-join, you will see that the subquery predicate
is gone and its tables and WHERE
clause were merged into the outer query join list and
WHERE clause.
Temporary table use for Duplicate Weedout is indicated
by Start temporary and End
temporary in the Extra
column. Tables that were not pulled out and are in the
range of EXPLAIN output
rows covered by Start temporary and
End temporary will have their
rowid in the temporary table.
FirstMatch(
in the tbl_name)Extra column indicates join
shortcutting.
LooseScan(
in the m..n)Extra column indicates use of
the LooseScan strategy. m and
n are key part numbers.
As of MySQL 5.6.7, temporary table use for
materialization is indicated by rows with a
select_type value of
MATERIALIZED and rows with a
table value of
<subquery.
N>
Before MySQL 5.6.7, temporary table use for
materialization is indicated in the
Extra column by
Materialize if a single table is
used, or by Start materialize and
End materialize if multiple tables
are used. If Scan is present, no
temporary table index is used for table reads.
Otherwise, an index lookup is used.
As of MySQL 5.6.5, the optimizer uses subquery materialization as a strategy that enables more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The table is indexed with a hash index to make lookups fast and inexpensive. The index is unique, which makes the table smaller because it has no duplicates.
Subquery materialization attempts to use an in-memory temporary table when possible, falling back to on-disk storage if the table becomes too large. See Section 8.4.4, “Internal Temporary Table Use in MySQL”.
If materialization is not used, the optimizer sometimes
rewrites a noncorrelated subquery as a correlated subquery.
For example, the following IN subquery is
noncorrelated (where_condition
involves only columns from t2 and not
t1):
SELECT * FROM t1
WHERE t1.a IN (SELECT t2.b FROM t2 WHERE where_condition);
The optimizer might rewrite this as an
EXISTS correlated subquery:
SELECT * FROM t1
WHERE EXISTS (SELECT t2.b FROM t2 WHERE where_condition AND t1.a=t2.b);
Subquery materialization using a temporary table avoids such rewrites and makes it possible to execute the subquery only once rather than once per row of the outer query.
For subquery materialization to be used in MySQL, the
materialization flag of the
optimizer_switch system
variable must be on. Materialization then
applies to subquery predicates that appear anywhere (in the
select list, WHERE,
ON, GROUP BY,
HAVING, or ORDER BY),
for predicates that fall into any of these use cases:
The predicate has this form, when no outer expression
oe_i or inner expression
ie_i is nullable.
N can be 1 or larger.
(oe_1,oe_2, ...,oe_N) [NOT] IN (SELECTie_1,i_2, ...,ie_N...)
The predicate has this form, when there is a single
outer expression oe and inner
expression ie. The
expressions can be nullable.
oe[NOT] IN (SELECTie...)
The predicate is IN or NOT
IN and a result of UNKNOWN
(NULL) has the same meaning as a
result of FALSE.
The following examples illustrate how the requirement for
equivalence of UNKNOWN and
FALSE predicate evaluation affects
whether subquery materialization can be used. Assume that
where_condition involves columns
only from t2 and not
t1 so that the subquery is noncorrelated.
This query is subject to materialization:
SELECT * FROM t1
WHERE t1.a IN (SELECT t2.b FROM t2 WHERE where_condition);
Here, it does not matter whether the IN
predicate returns UNKNOWN or
FALSE. Either way, the row from
t1 is not included in the query result.
An example where subquery materialization will not be used
is the following query, where t2.b is a
nullable column.
SELECT * FROM t1
WHERE (t1.a,t1.b) NOT IN (SELECT t2.a,t2.b FROM t2
WHERE where_condition);
The following restrictions apply to the use of subquery materialization:
The types of the inner and outer expressions must match. For example, the optimizer might be able to use materialization if both expressions are integer or both are decimal. The optimizer cannot use materialization if one expression is integer and the other is decimal.
The inner expression cannot be a
BLOB.
Use of EXPLAIN with a query
can give some indication of whether the optimizer uses
subquery materialization. Compared to query execution that
does not use materialization, select_type
may change from DEPENDENT SUBQUERY to
SUBQUERY. This indicates that, for a
subquery that would be executed once per outer row,
materialization enables the subquery to be executed just
once. In addition, for extended
EXPLAIN output, the text
displayed by a following SHOW
WARNINGS includes materialize
materialize and
materialized-subquery
(materialized subselect before MySQL
5.6.6).
As of MySQL 5.6.3, the optimizer more efficiently handles
derived tables (subqueries in the FROM
clause):
The optimizer postpones materialization of subqueries in
the FROM clause until their contents
are needed during query execution, which improves
performance:
Previously, subqueries in the
FROM clause were materialized for
EXPLAIN
SELECT statements. This resulted in
partial SELECT
execution, even though the purpose of
EXPLAIN is to obtain
query plan information, not to execute the query.
This materialization no longer occurs, so
EXPLAIN is faster for
such queries.
For non-EXPLAIN
queries, delay of materialization may result in not
having to do it at all. Consider a query that joins
the result of a subquery in the
FROM clause to another table: If
the optimizer processes that other table first and
finds that it returns no rows, the join need not be
carried out further and the optimizer can completely
skip materializing the subquery.
During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it.
Consider the following
EXPLAIN statement, for which
a subquery appears in the FROM clause of
a SELECT query:
EXPLAIN SELECT * FROM (SELECT * FROM t1) AS derived_t1;
The optimizer avoids materializing the subquery by delaying
it until the result is needed during
SELECT execution. In this
case, the query is not executed, so the result is never
needed.
Even for queries that are executed, delay of subquery
materialization may enable the optimizer to avoid
materialization entirely. Consider the following query,
which joins the result of a subquery in the
FROM clause to another table:
SELECT *
FROM t1 JOIN (SELECT t2.f1 FROM t2) AS derived_t2
ON t1.f2=derived_t2.f1
WHERE t1.f1 > 0;
If the optimization processes t1 first
and the WHERE clause produces an empty
result, the join must necessarily be empty and the subquery
need not be materialized.
In the worst case (derived tables are materialized), query execution takes the same time as before MySQL 5.6.3 because no additional work is done. In the best case (derived tables are not materialized), query execution is quicker by the time needed to perform materialization.
For cases when a derived table requires materialization, the
optimizer may speed up access to the result by adding an
index to the materialized table. If such an index enables
ref access to the table,
it can greatly reduce amount of data that must be read
during query execution. Consider the following query:
SELECT *
FROM t1 JOIN (SELECT DISTINCT f1 FROM t2) AS derived_t2
ON t1.f1=derived_t2.f1;
The optimizer constructs an index over column
f1 from derived_t2 if
doing so would enable use of
ref access for the lowest
cost execution plan. After adding the index, the optimizer
can treat the materialized derived table the same as a usual
table with an index, and it benefits similarly from the
generated index. The overhead of index creation is
negligible compared to the cost of query execution without
the index. If ref access
would result in higher cost than some other access method,
the optimizer creates no index and loses nothing.
Certain optimizations are applicable to comparisons that use
the IN operator to test subquery results
(or that use =ANY, which is equivalent).
This section discusses these optimizations, particularly
with regard to the challenges that NULL
values present. The last part of the discussion includes
suggestions on what you can do to help the optimizer.
Consider the following subquery comparison:
outer_exprIN (SELECTinner_exprFROM ... WHEREsubquery_where)
MySQL evaluates queries “from outside to
inside.” That is, it first obtains the value of the
outer expression outer_expr, and
then runs the subquery and captures the rows that it
produces.
A very useful optimization is to “inform” the
subquery that the only rows of interest are those where the
inner expression inner_expr is
equal to outer_expr. This is done
by pushing down an appropriate equality into the subquery's
WHERE clause. That is, the comparison is
converted to this:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereANDouter_expr=inner_expr)
After the conversion, MySQL can use the pushed-down equality to limit the number of rows that it must examine when evaluating the subquery.
More generally, a comparison of N
values to a subquery that returns
N-value rows is subject to the
same conversion. If oe_i and
ie_i represent corresponding
outer and inner expression values, this subquery comparison:
(oe_1, ...,oe_N) IN (SELECTie_1, ...,ie_NFROM ... WHEREsubquery_where)
Becomes:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereANDoe_1=ie_1AND ... ANDoe_N=ie_N)
For simplicity, the following discussion assumes a single pair of outer and inner expression values.
The conversion just described has its limitations. It is
valid only if we ignore possible NULL
values. That is, the “pushdown” strategy works
as long as both of these two conditions are true:
outer_expr and
inner_expr cannot be
NULL.
You do not need to distinguish NULL
from FALSE subquery results. If the
subquery is a part of an OR
or AND expression in the
WHERE clause, MySQL assumes that you
do not care. Another instance where the optimizer
notices that NULL and
FALSE subquery results need not be
distinguished is this construct:
... WHEREouter_exprIN (subquery)
In this case, the WHERE clause
rejects the row whether IN
( returns
subquery)NULL or FALSE.
When either or both of those conditions do not hold, optimization is more complex.
Suppose that outer_expr is known
to be a non-NULL value but the subquery
does not produce a row such that
outer_expr =
inner_expr. Then
evaluates as follows:
outer_expr IN (SELECT
...)
In this situation, the approach of looking for rows with
is no longer
valid. It is necessary to look for such rows, but if none
are found, also look for rows where
outer_expr =
inner_exprinner_expr is
NULL. Roughly speaking, the subquery can
be converted to something like this:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereAND (outer_expr=inner_exprORinner_exprIS NULL))
The need to evaluate the extra IS
NULL condition is why MySQL has the
ref_or_null access
method:
mysql>EXPLAIN->SELECT->outer_exprIN (SELECT t2.maybe_null_keyFROM t2, t3 WHERE ...)-> FROM t1; *************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: ref_or_null possible_keys: maybe_null_key key: maybe_null_key key_len: 5 ref: func rows: 2 Extra: Using where; Using index ...
The unique_subquery and
index_subquery
subquery-specific access methods also have “or
NULL” variants. However, they are
not visible in EXPLAIN
output, so you must use EXPLAIN
EXTENDED followed by SHOW
WARNINGS (note the checking
NULL in the warning message):
mysql>EXPLAIN EXTENDED->SELECT*************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: index_subquery possible_keys: maybe_null_key key: maybe_null_key key_len: 5 ref: func rows: 2 Extra: Using index mysql>outer_exprIN (SELECT maybe_null_key FROM t2) FROM t1\GSHOW WARNINGS\G*************************** 1. row *************************** Level: Note Code: 1003 Message: select (`test`.`t1`.`outer_expr`, (((`test`.`t1`.`outer_expr`) in t2 on maybe_null_key checking NULL))) AS `outer_expr IN (SELECT maybe_null_key FROM t2)` from `test`.`t1`
The additional OR ... IS NULL condition
makes query execution slightly more complicated (and some
optimizations within the subquery become inapplicable), but
generally this is tolerable.
The situation is much worse when
outer_expr can be
NULL. According to the SQL interpretation
of NULL as “unknown value,”
NULL IN (SELECT
should
evaluate to:
inner_expr ...)
For proper evaluation, it is necessary to be able to check
whether the SELECT has
produced any rows at all, so
cannot be
pushed down into the subquery. This is a problem, because
many real world subqueries become very slow unless the
equality can be pushed down.
outer_expr =
inner_expr
Essentially, there must be different ways to execute the
subquery depending on the value of
outer_expr.
The optimizer chooses SQL compliance over speed, so it
accounts for the possibility that
outer_expr might be
NULL.
If outer_expr is
NULL, to evaluate the following
expression, it is necessary to run the
SELECT to determine whether
it produces any rows:
NULL IN (SELECTinner_exprFROM ... WHEREsubquery_where)
It is necessary to run the original
SELECT here, without any
pushed-down equalities of the kind mentioned earlier.
On the other hand, when
outer_expr is not
NULL, it is absolutely essential that
this comparison:
outer_exprIN (SELECTinner_exprFROM ... WHEREsubquery_where)
be converted to this expression that uses a pushed-down condition:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereANDouter_expr=inner_expr)
Without this conversion, subqueries will be slow. To solve the dilemma of whether to push down or not push down conditions into the subquery, the conditions are wrapped in “trigger” functions. Thus, an expression of the following form:
outer_exprIN (SELECTinner_exprFROM ... WHEREsubquery_where)
is converted into:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereAND trigcond(outer_expr=inner_expr))
More generally, if the subquery comparison is based on several pairs of outer and inner expressions, the conversion takes this comparison:
(oe_1, ...,oe_N) IN (SELECTie_1, ...,ie_NFROM ... WHEREsubquery_where)
and converts it to this expression:
EXISTS (SELECT 1 FROM ... WHEREsubquery_whereAND trigcond(oe_1=ie_1) AND ... AND trigcond(oe_N=ie_N) )
Each
trigcond( is
a special function that evaluates to the following values:
X)
X when the
“linked” outer expression
oe_i is not
NULL
TRUE when the “linked”
outer expression oe_i is
NULL
Trigger functions are not triggers of
the kind that you create with CREATE
TRIGGER.
Equalities that are wrapped into
trigcond() functions are not first class
predicates for the query optimizer. Most optimizations
cannot deal with predicates that may be turned on and off at
query execution time, so they assume any
trigcond( to
be an unknown function and ignore it. At the moment,
triggered equalities can be used by those optimizations:
X)
Reference optimizations:
trigcond( can
be used to construct
X=Y
[OR Y IS NULL])ref,
eq_ref, or
ref_or_null table
accesses.
Index lookup-based subquery execution engines:
trigcond(
can be used to construct
X=Y)unique_subquery or
index_subquery
accesses.
Table-condition generator: If the subquery is a join of several tables, the triggered condition will be checked as soon as possible.
When the optimizer uses a triggered condition to create some
kind of index lookup-based access (as for the first two
items of the preceding list), it must have a fallback
strategy for the case when the condition is turned off. This
fallback strategy is always the same: Do a full table scan.
In EXPLAIN output, the
fallback shows up as Full scan on NULL
key in the Extra column:
mysql>EXPLAIN SELECT t1.col1,->t1.col1 IN (SELECT t2.key1 FROM t2 WHERE t2.col2=t1.col2) FROM t1\G*************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: index_subquery possible_keys: key1 key: key1 key_len: 5 ref: func rows: 2 Extra: Using where; Full scan on NULL key
If you run EXPLAIN EXTENDED
followed by SHOW WARNINGS,
you can see the triggered condition:
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: select `test`.`t1`.`col1` AS `col1`,
<in_optimizer>(`test`.`t1`.`col1`,
<exists>(<index_lookup>(<cache>(`test`.`t1`.`col1`) in t2
on key1 checking NULL
where (`test`.`t2`.`col2` = `test`.`t1`.`col2`) having
trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS
`t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)`
from `test`.`t1`
The use of triggered conditions has some performance
implications. A NULL IN (SELECT ...)
expression now may cause a full table scan (which is slow)
when it previously did not. This is the price paid for
correct results (the goal of the trigger-condition strategy
was to improve compliance and not speed).
For multiple-table subqueries, execution of NULL IN
(SELECT ...) will be particularly slow because the
join optimizer does not optimize for the case where the
outer expression is NULL. It assumes that
subquery evaluations with NULL on the
left side are very rare, even if there are statistics that
indicate otherwise. On the other hand, if the outer
expression might be NULL but never
actually is, there is no performance penalty.
To help the query optimizer better execute your queries, use these tips:
Declare a column as NOT NULL if it
really is. (This also helps other aspects of the
optimizer by simplifying condition testing for the
column.)
If you do not need to distinguish a
NULL from FALSE
subquery result, you can easily avoid the slow execution
path. Replace a comparison that looks like this:
outer_exprIN (SELECTinner_exprFROM ...)
with this expression:
(outer_exprIS NOT NULL) AND (outer_exprIN (SELECTinner_exprFROM ...))
Then NULL IN (SELECT ...) will never
be evaluated because MySQL stops evaluating
AND parts as soon as the
expression result is clear.
Another possible rewrite:
EXISTS (SELECTinner_exprFROM ... WHEREinner_expr=outer_expr)
This would apply when you need not distinguish
NULL from FALSE
subquery results, in which case you may actually want
EXISTS.
The subquery_materialization_cost_based
flag enables control over the choice between subquery
materialization and
IN-to-EXISTS subquery
transformation. See
Section 8.9.2, “Controlling Switchable Optimizations”.