of the planner
+
+
Single-Column Statistics
As we saw in the previous section, the query planner needs to estimate
the number of rows retrieved by a query in order to make good choices
Further details about the planner's use of statistics can be found in
.
+
+
+
+
Extended Statistics
+
+
+ of the planner
+
+
+ in the query planner
+
+
+
+
+
+ It is common to see slow queries running bad execution plans because
+ multiple columns used in the query clauses are correlated.
+ The planner normally assumes that multiple conditions
+ are independent of each other,
+ an assumption that does not hold when column values are correlated.
+ Regular statistics, because of their per-individual-column nature,
+ do not capture the knowledge of cross-column correlation;
+ multivariate statistics can be used to instruct
+ the server to obtain statistics across such a set of columns,
+ which are later used by the query optimizer
+ to determine cardinality and selectivity
+ of clauses involving those columns.
+ Multivariate statistics are currently the only use of
+ extended statistics .
+
+
+ Extended statistics are created using
+ , which see for more details.
+ Data collection is deferred until the next ANALYZE
+ on the table, after which the stored values can be examined in the
+ pg_statistic_ext
+ catalog.
+
+
+ The following subsections describe the types of extended statistics
+ that are currently supported.
+
+
+
+
Functional Dependencies
+
+ The simplest type of extended statistics are functional dependencies,
+ a concept used in definitions of database normal forms.
+ Put simply, it is said that column b> is functionally
+ dependent on column a> if knowledge of the value of
+ a> is sufficient to determine the value of b>.
+ In normalized databases, functional dependencies are allowed only on
+ primary keys and superkeys. However, many data sets are in practice not
+ fully normalized for various reasons; intentional denormalization for
+ performance reasons is a common example.
+
+
+ The existance of functional dependencies directly affects the accuracy
+ of estimates in certain queries.
+ The reason is that conditions on the dependent columns do not
+ restrict the result set, but the query planner (lacking functional
+ dependency knowledge) considers them independent, resulting in
+ underestimates.
+ To inform the planner about the functional dependencies, we collect
+ measurements of dependency during ANALYZE>. Assessing
+ the degree of dependency between all sets of columns would be
+ prohibitively expensive, so the search is limited to potential
+ dependencies defined using the dependencies> option of
+ extended statistics. It is advisable to create
+ dependencies> statistics if and only if functional
+ dependencies actually exist, to avoid unnecessary overhead on both
+ ANALYZE> and query planning.
+
+
+ To inspect functional dependencies on a statistics
+ stts , you may do this:
+CREATE STATISTICS stts WITH (dependencies)
+ ON (zip, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxname, stxkeys, stxdependencies
+ FROM pg_statistic_ext
+ WHERE stxname = 'stts';
+ stxname | stxkeys | stxdependencies
+---------+---------+--------------------------------------------
+ stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
+(1 row)
+
+ where it can be seen that column 1 (a zip code) fully determines column
+ 5 (city) so the coefficient is 1.0, while city only determines zip code
+ about 42% of the time, meaning that there are many cities (58%) that are
+ represented by more than a single ZIP code.
+
+
+ When computing the selectivity, the planner inspects all conditions and
+ attempts to identify which conditions are already implied by other
+ conditions. The selectivity estimates from any redundant conditions are
+ ignored from a selectivity point of view. In the example query above,
+ the selectivity estimates for either of the conditions may be eliminated,
+ thus improving the overall estimate.
+
+
+
+
Limitations of Functional Dependencies
+
+ Functional dependencies are a very simple type of statistics, and
+ as such have several limitations. The first limitation is that they
+ only work with simple equality conditions, comparing columns and constant
+ values. It's not possible to use them to eliminate equality conditions
+ comparing two columns or a column to an expression, range clauses,
+ LIKE> or any other type of conditions.
+
+
+ When eliminating the implied conditions, the planner assumes that the
+ conditions are compatible. Consider the following example, where
+ this assumption does not hold:
+
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
+ Filter: ((a = 1) AND (b = 10))
+ Rows Removed by Filter: 10000
+
+
+ While there are no rows with such combination of values, the planner
+ is unable to verify whether the values match — it only knows that
+ the columns are functionally dependent.
+
+
+ This assumption is related to queries executed on the database; in many
+ cases, it's actually satisfied (e.g. when the GUI only allows selecting
+ compatible values). But if that's not the case, functional dependencies
+ may not be a viable option.
+
+
+
+
+
+
Multivariate N-Distinct Coefficients
+
+ Single-column statistics store the number of distinct values in each
+ column. Estimates of the number of distinct values on more than one
+ column (for example, for GROUP BY a, b ) are
+ frequently wrong when the planner only has single-column statistical
+ data, however, causing it to select bad plans.
+ In order to improve n-distinct estimation when multiple columns are
+ grouped together, the ndistinct> option of extended statistics
+ can be used, which instructs ANALYZE> to collect n-distinct
+ estimates for all possible combinations of two or more columns of the set
+ of columns in the statistics object (the per-column estimates are already
+ available in pg_statistic>).
+
+
+ Continuing the above example, the n-distinct coefficients in a ZIP
+ code table may look like the following:
+CREATE STATISTICS stts2 WITH (ndistinct)
+ ON (zip, state, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxkeys AS k, stxndistinct AS nd
+ FROM pg_statistic_ext
+ WHERE stxname = 'stts2';
+-[ RECORD 1 ]---------------------------------------------
+k | 1 2 5
+nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
+ {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
+(1 row)
+
+ which indicates that there are three combinations of columns that
+ have 33178 distinct values: ZIP code and state; ZIP code and city;
+ and ZIP code, city and state (the fact that they are all equal is
+ expected given the nature of ZIP-code data). On the other hand,
+ the combination of city and state only has 27435 distinct values.
+
+
+
operator-specific selectivity functions are mostly found
in src/backend/utils/adt/selfuncs.c .
+
- es">
- Functional Dependenci es
+ es">
+ Multivariate Statistics Exampl es
- The simplest type of extended statistics are functional dependencies,
- used in definitions of database normal forms. When simplified, saying that
- b> is functionally dependent on a> means that
- knowledge of value of a> is sufficient to determine value of
- b>.
-
+
+ multivariate
+
+
+
Functional dependencies
- In normalized databases, only functional dependencies on primary keys
- and superkeys are allowed. However, in practice, many data sets are not
- fully normalized, for example, due to intentional denormalization for
- performance reasons.
-
+ Multivariate correlation can be seen with a very simple data set — a
+ table with two columns, both containing the same values:
- Functional dependencies directly affect accuracy of the estimates, as
- conditions on the dependent column(s) do not restrict the result set,
- resulting in underestimates.
+CREATE TABLE t (a INT, b INT);
+INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
+ANALYZE t;
+
+
+ As explained in , the planner can determine
+ cardinality of t using the number of pages and
+ rows obtained from pg_class :
+
+SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
+
+ relpages | reltuples
+----------+-----------
+ 45 | 10000
+
+
+ The data distribution is very simple; there are only 100 distinct values
+ in each column, uniformly distributed.
- To inform the planner about the functional dependencies, we collect
- measurements of dependency during ANALYZE>. Assessing
- dependency between all sets of columns would be prohibitively
- expensive, so we limit our search to potential dependencies defined
- using the CREATE STATISTICS> command.
+ The following example shows the result of estimating a WHERE>
+ condition on the a> column:
-CREATE TABLE t (a INT, b INT);
-INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
-CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
-ANALYZE t;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
- QUERY PLAN
--------------------------------------------------------------------------------------------------
- Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
- Filter: ((a = 1) AND (b = 1))
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
+ Filter: (a = 1)
Rows Removed by Filter: 9900
- Planning time: 0.367 ms
- Execution time: 3.380 ms
-(5 rows)
- The planner is now aware of the functional dependencies and considers
- them when computing the selectivity of the second condition. Running
- the query without the statistics would lead to quite different estimates.
+ The planner examines the condition and determines the selectivity
+ of this clause to be 1%. By comparing this estimate and the actual
+ number of rows, we see that the estimate is very accurate
+ (in fact exact, as the table is very small). Changing the
+ WHERE> to use the b> column, an identical
+ plan is generated. Observe what happens if we apply the same
+ condition on both columns combining them with AND>:
-DROP STATISTICS s1;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
- QUERY PLAN
------------------------------------------------------------------------------------------------
- Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
- Planning time: 0.000 ms
- Execution time: 6.379 ms
-(5 rows)
-
- If no dependency exists, the collected statistics do not influence the
- query plan. The only effect is to slow down ANALYZE>. Should
- partial dependencies exist these will also be stored and applied
- during planning.
+ The planner estimates the selectivity for each condition individually,
+ arriving to the 1% estimates as above, and then multiplies them, getting
+ the final 0.01% estimate. The actual
figures, however,
+ show that this results in a significant underestimate, as the actual
+ number of rows matching the conditions (100) is two orders of magnitude
+ higher than the estimated value.
- Similarly to per-column statistics, extended statistics are stored in
- a system catalog called pg_statistic_ext .
- To inspect the statistics s1 defined above,
- you may do this:
+ This problem can be fixed by applying functional-dependency
+ multivariate statistics on the two columns:
-SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1';
- stxname | stxdependencies
----------+--------------------------------------------
- s1 | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
-(1 row)
+CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
+ Filter: ((a = 1) AND (b = 1))
+ Rows Removed by Filter: 9900
-
- This shows that the statistics are defined on table t>,
- attnums lists attribute numbers of columns
- (references pg_attribute ). It also shows
- the length in bytes of the functional dependencies, as found by
- ANALYZE> when serialized into a bytea> column.
+
+
+
Multivariate N-Distinct coefficients
- When computing the selectivity, the planner inspects all conditions and
- attempts to identify which conditions are already implied by other
- conditions. The selectivity estimates from any redundant conditions are
- ignored from a selectivity point of view. In the example query above,
- the selectivity estimates for either of the conditions may be eliminated,
- thus improving the overall estimate.
+ A similar problem occurs with estimation of the cardinality of distinct
+ elements, used to determine the number of groups that would be generated
+ by a GROUP BY clause. When GROUP BY
+ lists a single column, the n-distinct estimate (which can be seen as the
+ number of rows returned by the aggregate execution node) is very accurate:
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
+ Group Key: a
+ -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)
+
+ But without multivariate statistics, the estimate for the number of
+ groups in a query with two columns in GROUP BY , as
+ in the following example, is off by an order of magnitude:
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+ QUERY PLAN
+--------------------------------------------------------------------------------------------
+ HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
+ Group Key: a, b
+ -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+
+ By dropping the existing statistics and re-creating it to include n-distinct
+ calculation, the estimate is much improved:
+DROP STATISTICS stts;
+CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+ QUERY PLAN
+--------------------------------------------------------------------------------------------
+ HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
+ Group Key: a, b
+ -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+
-
-
Limitations of functional dependencies
-
- Functional dependencies are a very simple type of statistics, and
- as such have several limitations. The first limitation is that they
- only work with simple equality conditions, comparing columns and constant
- values. It's not possible to use them to eliminate equality conditions
- comparing two columns or a column to an expression, range clauses,
- LIKE> or any other type of conditions.
-
-
- When eliminating the implied conditions, the planner assumes that the
- conditions are compatible. Consider the following example, violating
- this assumption:
-
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
- QUERY PLAN
------------------------------------------------------------------------------------------------
- Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
- Filter: ((a = 1) AND (b = 10))
- Rows Removed by Filter: 10000
- Planning time: 0.232 ms
- Execution time: 3.033 ms
-(5 rows)
-
-
- While there are no rows with such combination of values, the planner
- is unable to verify whether the values match - it only knows that
- the columns are functionally dependent.
-
-
- This assumption is more about queries executed on the database - in many
- cases, it's actually satisfied (e.g. when the GUI only allows selecting
- compatible values). But if that's not the case, functional dependencies
- may not be a viable option.
-
-
- For additional information about functional dependencies, see
- src/backend/statistics/README.dependencies>.
-
-
-
-
-
-