Improve multivariate statistics documentation

author Alvaro Herrera

Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)

committer Alvaro Herrera

Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
author Alvaro Herrera
Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
committer Alvaro Herrera
Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml

index ef36e87a720d98b966a5f103c613416faba69ca4..ed74704b2adebf6cb7e5048f64855e72febca378 100644 (file)
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4320,6 +4320,17 @@
        Owner of the statistic
       
  
+     
+      stxkeys
+      int2vector
+      pg_attribute.attnum
+      
+       This is an array of values that indicate which table columns this
+       statistic covers. For example a value of 1 3 would
+       mean that the first and the third table columns make up the statistic key.
+      
+     
+
       
        stxkind
        char[]
@@ -4332,17 +4343,6 @@
        
       
  
-     
-      stxkeys
-      int2vector
-      pg_attribute.attnum
-      
-       This is an array of values that indicate which table columns this
-       statistic covers. For example a value of 1 3 would
-       mean that the first and the third table columns make up the statistic key.
-      
-     
-
       
        stxndistinct
        pg_ndistinct
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml

index 8d30fd1384caef3f6ea4676c4a0789eda0d1df51..a8bebb14363a4ca06f9389c2352db158cd0a224f 100644 (file)
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
     of the planner
    
  
+  
+   Single-Column Statistics
    
     As we saw in the previous section, the query planner needs to estimate
     the number of rows retrieved by a query in order to make good choices
@@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
     Further details about the planner's use of statistics can be found in
     .
    
+  
+
+  
+   Extended Statistics
+
+   
+    statistics
+    of the planner
+   
  
+   
+    correlation
+    in the query planner
+   
+
+   
+    pg_statistic_ext
+   
+
+   
+    It is common to see slow queries running bad execution plans because
+    multiple columns used in the query clauses are correlated.
+    The planner normally assumes that multiple conditions
+    are independent of each other,
+    an assumption that does not hold when column values are correlated.
+    Regular statistics, because of their per-individual-column nature,
+    do not capture the knowledge of cross-column correlation;
+    multivariate statistics can be used to instruct
+    the server to obtain statistics across such a set of columns,
+    which are later used by the query optimizer
+    to determine cardinality and selectivity
+    of clauses involving those columns.
+    Multivariate statistics are currently the only use of
+    extended statistics.
+   
+
+   
+    Extended statistics are created using
+    , which see for more details.
+    Data collection is deferred until the next ANALYZE
+    on the table, after which the stored values can be examined in the
+    pg_statistic_ext
+    catalog.
+   
+
+   
+    The following subsections describe the types of extended statistics
+    that are currently supported.
+   
+
+   
+    Functional Dependencies
+
+    
+     The simplest type of extended statistics are functional dependencies,
+     a concept used in definitions of database normal forms.
+     Put simply, it is said that column b is functionally
+     dependent on column a if knowledge of the value of
+     a is sufficient to determine the value of b.
+     In normalized databases, functional dependencies are allowed only on
+     primary keys and superkeys. However, many data sets are in practice not
+     fully normalized for various reasons; intentional denormalization for
+     performance reasons is a common example.
+    
+
+    
+     The existance of functional dependencies directly affects the accuracy
+     of estimates in certain queries.
+     The reason is that conditions on the dependent columns do not
+     restrict the result set, but the query planner (lacking functional
+     dependency knowledge) considers them independent, resulting in
+     underestimates.
+     To inform the planner about the functional dependencies, we collect
+     measurements of dependency during ANALYZE. Assessing
+     the degree of dependency between all sets of columns would be
+     prohibitively expensive, so the search is limited to potential
+     dependencies defined using the dependencies option of
+     extended statistics.  It is advisable to create
+     dependencies statistics if and only if functional
+     dependencies actually exist, to avoid unnecessary overhead on both
+     ANALYZE and query planning.
+    
+
+    
+     To inspect functional dependencies on a statistics
+     stts, you may do this:
+
+CREATE STATISTICS stts WITH (dependencies)
+               ON (zip, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxname, stxkeys, stxdependencies
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts';
+ stxname | stxkeys |              stxdependencies               
+---------+---------+--------------------------------------------
+ stts    | 1 5     | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
+(1 row)
+
+     where it can be seen that column 1 (a zip code) fully determines column
+     5 (city) so the coefficient is 1.0, while city only determines zip code
+     about 42% of the time, meaning that there are many cities (58%) that are
+     represented by more than a single ZIP code.
+    
+
+    
+     When computing the selectivity, the planner inspects all conditions and
+     attempts to identify which conditions are already implied by other
+     conditions.  The selectivity estimates from any redundant conditions are
+     ignored from a selectivity point of view. In the example query above,
+     the selectivity estimates for either of the conditions may be eliminated,
+     thus improving the overall estimate.
+    
+
+    
+     Limitations of Functional Dependencies
+
+     
+      Functional dependencies are a very simple type of statistics, and
+      as such have several limitations. The first limitation is that they
+      only work with simple equality conditions, comparing columns and constant
+      values. It's not possible to use them to eliminate equality conditions
+      comparing two columns or a column to an expression, range clauses,
+      LIKE or any other type of conditions.
+     
+
+     
+      When eliminating the implied conditions, the planner assumes that the
+      conditions are compatible. Consider the following example, where
+      this assumption does not hold:
+
+
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
+   Filter: ((a = 1) AND (b = 10))
+   Rows Removed by Filter: 10000
+
+
+      While there are no rows with such combination of values, the planner
+      is unable to verify whether the values match — it only knows that
+      the columns are functionally dependent.
+     
+
+     
+      This assumption is related to queries executed on the database; in many
+      cases, it's actually satisfied (e.g. when the GUI only allows selecting
+      compatible values). But if that's not the case, functional dependencies
+      may not be a viable option.
+     
+    
+   
+
+   
+    Multivariate N-Distinct Coefficients
+
+    
+     Single-column statistics store the number of distinct values in each
+     column.  Estimates of the number of distinct values on more than one
+     column (for example, for GROUP BY a, b) are
+     frequently wrong when the planner only has single-column statistical
+     data, however, causing it to select bad plans.
+     In order to improve n-distinct estimation when multiple columns are
+     grouped together, the ndistinct option of extended statistics
+     can be used, which instructs ANALYZE to collect n-distinct
+     estimates for all possible combinations of two or more columns of the set
+     of columns in the statistics object (the per-column estimates are already
+     available in pg_statistic).
+    
+
+    
+     Continuing the above example, the n-distinct coefficients in a ZIP
+     code table may look like the following:
+
+CREATE STATISTICS stts2 WITH (ndistinct)
+               ON (zip, state, city) FROM zipcodes;
+ANALYZE zipcodes;
+SELECT stxkeys AS k, stxndistinct AS nd
+  FROM pg_statistic_ext
+ WHERE stxname = 'stts2';
+-[ RECORD 1 ]---------------------------------------------
+k  | 1 2 5
+nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
+      {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
+(1 row)
+
+     which indicates that there are three combinations of columns that
+     have 33178 distinct values: ZIP code and state; ZIP code and city;
+     and ZIP code, city and state (the fact that they are all equal is
+     expected given the nature of ZIP-code data).  On the other hand,
+     the combination of city and state only has 27435 distinct values.
+    
+   
+  
   
  
   
diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml

index 124e7e20cede9d7d63d749cdcdb0e4c342dbf2fc..f4430eb23cf40cdd0973c2f05bb7a927f1c2667c 100644 (file)
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -445,159 +445,141 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
     operator-specific selectivity functions are mostly found
     in src/backend/utils/adt/selfuncs.c.
    
+ 
  
-  es">
-   Functional Dependenci</span>es
+ es">
+  Multivariate Statistics Exampl</span>es
  
-   
-    The simplest type of extended statistics are functional dependencies,
-    used in definitions of database normal forms. When simplified, saying that
-    b is functionally dependent on a means that
-    knowledge of value of a is sufficient to determine value of
-    b.
-   
+  
+   row estimation
+   multivariate
+  
  
+  
+   Functional dependencies
     
-    In normalized databases, only functional dependencies on primary keys
-    and superkeys are allowed. However, in practice, many data sets are not
-    fully normalized, for example, due to intentional denormalization for
-    performance reasons.
-   
+    Multivariate correlation can be seen with a very simple data set — a
+    table with two columns, both containing the same values:
  
-   
-    Functional dependencies directly affect accuracy of the estimates, as
-    conditions on the dependent column(s) do not restrict the result set,
-    resulting in underestimates.
+
+CREATE TABLE t (a INT, b INT);
+INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
+ANALYZE t;
+
+
+    As explained in , the planner can determine
+    cardinality of t using the number of pages and
+    rows obtained from pg_class:
+
+
+SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
+
+ relpages | reltuples
+----------+-----------
+       45 |     10000
+
+
+    The data distribution is very simple; there are only 100 distinct values
+    in each column, uniformly distributed.
     
  
     
-    To inform the planner about the functional dependencies, we collect
-    measurements of dependency during ANALYZE. Assessing
-    dependency between all sets of columns would be prohibitively
-    expensive, so we limit our search to potential dependencies defined
-    using the CREATE STATISTICS command.
+    The following example shows the result of estimating a WHERE
+    condition on the a column:
  
  
-CREATE TABLE t (a INT, b INT);
-INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
-CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
-ANALYZE t;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                             
--------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
-   Filter: ((a = 1) AND (b = 1))
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
+                                 QUERY PLAN                                  
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
+   Filter: (a = 1)
     Rows Removed by Filter: 9900
- Planning time: 0.367 ms
- Execution time: 3.380 ms
-(5 rows)
  
  
-    The planner is now aware of the functional dependencies and considers
-    them when computing the selectivity of the second condition.  Running
-    the query without the statistics would lead to quite different estimates.
+    The planner examines the condition and determines the selectivity
+    of this clause to be 1%.  By comparing this estimate and the actual
+    number of rows, we see that the estimate is very accurate
+    (in fact exact, as the table is very small).  Changing the
+    WHERE to use the b column, an identical
+    plan is generated.  Observe what happens if we apply the same
+    condition on both columns combining them with AND:
  
  
-DROP STATISTICS s1;
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
-                                          QUERY PLAN                                           
------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
     Filter: ((a = 1) AND (b = 1))
     Rows Removed by Filter: 9900
- Planning time: 0.000 ms
- Execution time: 6.379 ms
-(5 rows)
  
-   
  
-   
-    If no dependency exists, the collected statistics do not influence the
-    query plan. The only effect is to slow down ANALYZE. Should
-    partial dependencies exist these will also be stored and applied
-    during planning.
+    The planner estimates the selectivity for each condition individually,
+    arriving to the 1% estimates as above, and then multiplies them, getting
+    the final 0.01% estimate. The actual figures, however,
+    show that this results in a significant underestimate, as the actual
+    number of rows matching the conditions (100) is two orders of magnitude
+    higher than the estimated value.
     
  
     
-    Similarly to per-column statistics, extended statistics are stored in
-    a system catalog called pg_statistic_ext.
-    To inspect the statistics s1 defined above,
-    you may do this:
+    This problem can be fixed by applying functional-dependency
+    multivariate statistics on the two columns:
  
  
-SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1';
- stxname |              stxdependencies
----------+--------------------------------------------
- s1      | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
-(1 row)
+CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
+   Filter: ((a = 1) AND (b = 1))
+   Rows Removed by Filter: 9900
  
-
-     This shows that the statistics are defined on table t,
-     attnums lists attribute numbers of columns
-     (references pg_attribute). It also shows
-     the length in bytes of the functional dependencies, as found by
-     ANALYZE when serialized into a bytea column.
     
+  
  
+  
+   Multivariate N-Distinct coefficients
     
-    When computing the selectivity, the planner inspects all conditions and
-    attempts to identify which conditions are already implied by other
-    conditions.  The selectivity estimates from any redundant conditions are
-    ignored from a selectivity point of view. In the example query above,
-    the selectivity estimates for either of the conditions may be eliminated,
-    thus improving the overall estimate.
+    A similar problem occurs with estimation of the cardinality of distinct
+    elements, used to determine the number of groups that would be generated
+    by a GROUP BY clause.  When GROUP BY
+    lists a single column, the n-distinct estimate (which can be seen as the
+    number of rows returned by the aggregate execution node) is very accurate:
+
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
+                                       QUERY PLAN                                        
+-----------------------------------------------------------------------------------------
+ HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
+   Group Key: a
+   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)
+
+    But without multivariate statistics, the estimate for the number of
+    groups in a query with two columns in GROUP BY, as
+    in the following example, is off by an order of magnitude:
+
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+
+    By dropping the existing statistics and re-creating it to include n-distinct
+    calculation, the estimate is much improved:
+
+DROP STATISTICS stts;
+CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t;
+ANALYZE t;
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
+                                       QUERY PLAN                                        
+--------------------------------------------------------------------------------------------
+ HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
+   Group Key: a, b
+   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)
+
     
  
-    
-     Limitations of functional dependencies
-
-     
-      Functional dependencies are a very simple type of statistics, and
-      as such have several limitations. The first limitation is that they
-      only work with simple equality conditions, comparing columns and constant
-      values. It's not possible to use them to eliminate equality conditions
-      comparing two columns or a column to an expression, range clauses,
-      LIKE or any other type of conditions.
-     
-
-     
-      When eliminating the implied conditions, the planner assumes that the
-      conditions are compatible. Consider the following example, violating
-      this assumption:
-
-
-EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
-                                          QUERY PLAN
------------------------------------------------------------------------------------------------
- Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
-   Filter: ((a = 1) AND (b = 10))
-   Rows Removed by Filter: 10000
- Planning time: 0.232 ms
- Execution time: 3.033 ms
-(5 rows)
-
-
-      While there are no rows with such combination of values, the planner
-      is unable to verify whether the values match - it only knows that
-      the columns are functionally dependent.
-     
-
-     
-      This assumption is more about queries executed on the database - in many
-      cases, it's actually satisfied (e.g. when the GUI only allows selecting
-      compatible values). But if that's not the case, functional dependencies
-      may not be a viable option.
-     
-
-     
-      For additional information about functional dependencies, see
-      src/backend/statistics/README.dependencies.
-     
-
-    
-
    
-
   
-
author	Alvaro Herrera
	Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
committer	Alvaro Herrera
	Thu, 20 Apr 2017 18:43:33 +0000 (15:43 -0300)
doc/src/sgml/catalogs.sgml		patch \| blob \| blame \| history
doc/src/sgml/perform.sgml		patch \| blob \| blame \| history
doc/src/sgml/planstats.sgml		patch \| blob \| blame \| history