Add a section about the planner's statistics, including a description

author Tom Lane

Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)

committer Tom Lane

Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)
author Tom Lane
Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)
committer Tom Lane
Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml

index 61c28bd000e3f9cae02ae729e22028d524be577a..c10be1e34f68f038e8b39e298bab7a747c3cbb29 100644 (file)
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1,5 +1,5 @@
  
  
   
@@ -315,6 +315,200 @@ Total runtime: 30.67 msec
     
    
  
+ 
+  Statistics used by the Planner
+
+  
+   As we saw in the previous section, the query planner needs to estimate
+   the number of rows retrieved by a query in order to make good choices
+   of query plans.  This section provides a quick look at the statistics
+   that the system uses for these estimates.
+  
+
+  
+   One component of the statistics is the total number of entries in each
+   table and index, as well as the number of disk blocks occupied by each
+   table and index.  This information is kept in
+   pg_class's reltuples
+   and relpages columns.  We can look at it
+   with queries similar to this one:
+
+
+regression=# select relname, relkind, reltuples, relpages from pg_class
+regression-# where relname like 'tenk1%';
+    relname    | relkind | reltuples | relpages
+---------------+---------+-----------+----------
+ tenk1         | r       |     10000 |      233
+ tenk1_hundred | i       |     10000 |       30
+ tenk1_unique1 | i       |     10000 |       30
+ tenk1_unique2 | i       |     10000 |       30
+(4 rows)
+
+
+   Here we can see that tenk1 contains 10000
+   rows, as do its indexes, but the indexes are (unsurprisingly) much
+   smaller than the table.
+  
+
+  
+   For efficiency reasons, reltuples 
+   and relpages are not updated on-the-fly,
+   and so they usually contain only approximate values (which is good
+   enough for the planner's purposes).  They are initialized with dummy
+   values (presently 1000 and 10 respectively) when a table is created.
+   They are updated by certain commands, presently VACUUM,
+   ANALYZE, and CREATE INDEX.  A stand-alone
+   ANALYZE, that is one not part of VACUUM,
+   generates an approximate reltuples value
+   since it does not read every row of the table.
+  
+
+  
+   Most queries retrieve only a fraction of the rows in a table, due
+   to having WHERE clauses that restrict the rows to be examined.
+   The planner thus needs to make an estimate of the
+   selectivity of WHERE clauses, that is, the fraction of
+   rows that match each clause of the WHERE condition.  The information
+   used for this task is stored in the pg_statistic
+   system catalog.  Entries in pg_statistic are
+   updated by ANALYZE and VACUUM ANALYZE commands,
+   and are always approximate even when freshly updated.
+  
+
+  
+   Rather than look at pg_statistic directly,
+   it's better to look at its view pg_stats
+   when examining the statistics manually.  pg_stats
+   is designed to be more easily readable.  Furthermore,
+   pg_stats is readable by all, whereas
+   pg_statistic is only readable by the superuser.
+   (This prevents unprivileged users from learning something about
+   the contents of other people's tables from the statistics.  The
+   pg_stats view is restricted to show only
+   rows about tables that the current user can read.)
+   For example, we might do:
+
+
+regression=# select attname, n_distinct, most_common_vals from pg_stats where tablename = 'road';
+ attname | n_distinct |                                                                                                                                                                                  most_common_vals                                                                                                                                                                                   
+---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ name    |  -0.467008 | {"I- 580                        Ramp","I- 880                        Ramp","Sp Railroad                       ","I- 580                            ","I- 680                        Ramp","I- 80                         Ramp","14th                          St  ","5th                           St  ","Mission                       Blvd","I- 880                            "}
+ thepath |         20 | {"[(-122.089,37.71),(-122.0886,37.711)]"}
+(2 rows)
+regression=#
+
+
+   As of Postgres 7.2 the following columns exist
+   in pg_stats:
+  
+
+  +   pg_stats Columns
+
+   
+    
+     
+      Name
+      Type
+      Description
+     
+    
+
+    
+     
+      tablename
+      name
+      Name of table containing column
+     
+
+     
+      attname
+      name
+      Column described by this row
+     
+
+     
+      null_frac
+      real
+      Fraction of column's entries that are NULL
+     
+
+     
+      avg_width
+      integer
+      Average width in bytes of column's entries
+     
+
+     
+      n_distinct
+      real
+      If greater than zero, the estimated number of distinct values
+      in the column.  If less than zero, the negative of the number of
+      distinct values divided by the number of rows.  (The negated form
+      is used when ANALYZE believes that the number of distinct values
+      is likely to increase as the table grows; the positive form is used
+      when the column seems to have a fixed number of possible values.)
+      For example, -1 indicates a unique column in which the number of
+      distinct values is the same as the number of rows.
+      
+     
+
+     
+      most_common_vals
+      text[]
+      A list of the most common values in the column. (Omitted if
+      no values seem to be more common than any others.)
+     
+
+     
+      most_common_freqs
+      real[]
+      A list of the frequencies of the most common values,
+      ie, number of occurrences of each divided by total number of rows.
+     
+     
+
+     
+      histogram_bounds
+      text[]
+      A list of values that divide the column's values into
+      groups of approximately equal population.  The 
+      most_common_vals, if present, are omitted from the
+      histogram calculation.  (Omitted if column datatype does not have a
+      < operator, or if the most_common_vals
+      list accounts for the entire population.)
+      
+     
+
+     
+      correlation
+      real
+      Statistical correlation between physical row ordering and
+      logical ordering of the column values.  This ranges from -1 to +1.
+      When the value is near -1 or +1, an indexscan on the column will
+      be estimated to be cheaper than when it is near zero, due to reduction
+      of random access to the disk.  (Omitted if column datatype does
+      not have a < operator.)
+      
+     
+    
+   
+  
+   pg_stats Columns
+
+   
+    
+     
+      Name
+      Type
+      Description
+     
+    
+
+    
+     
+      tablename
+      name
+      Name of table containing column
+     
+
+     
+      attname
+      name
+      Column described by this row
+     
+
+     
+      null_frac
+      real
+      Fraction of column's entries that are NULL
+     
+
+     
+      avg_width
+      integer
+      Average width in bytes of column's entries
+     
+
+     
+      n_distinct
+      real
+      If greater than zero, the estimated number of distinct values
+      in the column.  If less than zero, the negative of the number of
+      distinct values divided by the number of rows.  (The negated form
+      is used when ANALYZE believes that the number of distinct values
+      is likely to increase as the table grows; the positive form is used
+      when the column seems to have a fixed number of possible values.)
+      For example, -1 indicates a unique column in which the number of
+      distinct values is the same as the number of rows.
+      
+     
+
+     
+      most_common_vals
+      text[]
+      A list of the most common values in the column. (Omitted if
+      no values seem to be more common than any others.)
+     
+
+     
+      most_common_freqs
+      real[]
+      A list of the frequencies of the most common values,
+      ie, number of occurrences of each divided by total number of rows.
+     
+     
+
+     
+      histogram_bounds
+      text[]
+      A list of values that divide the column's values into
+      groups of approximately equal population.  The 
+      most_common_vals, if present, are omitted from the
+      histogram calculation.  (Omitted if column datatype does not have a
+      < operator, or if the most_common_vals
+      list accounts for the entire population.)
+      
+     
+
+     
+      correlation
+      real
+      Statistical correlation between physical row ordering and
+      logical ordering of the column values.  This ranges from -1 to +1.
+      When the value is near -1 or +1, an indexscan on the column will
+      be estimated to be cheaper than when it is near zero, due to reduction
+      of random access to the disk.  (Omitted if column datatype does
+      not have a < operator.)
+      
+     
+    
+   
+  
+
+  
+   The maximum number of entries in the most_common_vals
+   and histogram_bounds arrays can be set on a
+   column-by-column basis using the ALTER TABLE SET STATISTICS
+   command.  The default limit is presently 10 entries.  Raising the limit
+   may allow more accurate planner estimates to be made, particularly for
+   columns with irregular data distributions, at the price of consuming
+   more space in pg_statistic and slightly more
+   time to compute the estimates.  Conversely, a lower limit may be
+   appropriate for columns with simple data distributions.
+  
+
+ 
+
   
    Controlling the Planner with Explicit JOINs
  
diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml

index 98f38aefb90a029b82c0d95a3af1d5614cdf7640..70e172790ba32865ca2d2e2c3746b4365b78950b 100644 (file)
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -1,5 +1,5 @@
  
  
@@ -152,7 +152,9 @@ ANALYZE [ VERBOSE ] [ table [ (
     distribution in each column.  One or both of these may be omitted if
     ANALYZE deems them uninteresting (for example, in
     a unique-key column, there are no common values) or if the column
-   datatype does not support the appropriate operators.
+   datatype does not support the appropriate operators.  There is more
+   information about the statistics in the User's
+   Guide.
author	Tom Lane
	Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)
committer	Tom Lane
	Tue, 16 Oct 2001 01:13:44 +0000 (01:13 +0000)
doc/src/sgml/perform.sgml		patch \| blob \| blame \| history
doc/src/sgml/ref/analyze.sgml		patch \| blob \| blame \| history