+
+
Statistics used by the Planner
+
+ As we saw in the previous section, the query planner needs to estimate
+ the number of rows retrieved by a query in order to make good choices
+ of query plans. This section provides a quick look at the statistics
+ that the system uses for these estimates.
+
+
+ One component of the statistics is the total number of entries in each
+ table and index, as well as the number of disk blocks occupied by each
+ table and index. This information is kept in
+ pg_class's reltuples
+ and relpages columns. We can look at it
+ with queries similar to this one:
+
+
+regression=# select relname, relkind, reltuples, relpages from pg_class
+regression-# where relname like 'tenk1%';
+ relname | relkind | reltuples | relpages
+---------------+---------+-----------+----------
+ tenk1 | r | 10000 | 233
+ tenk1_hundred | i | 10000 | 30
+ tenk1_unique1 | i | 10000 | 30
+ tenk1_unique2 | i | 10000 | 30
+(4 rows)
+
+
+ Here we can see that tenk1 contains 10000
+ rows, as do its indexes, but the indexes are (unsurprisingly) much
+ smaller than the table.
+
+
+ For efficiency reasons, reltuples
+ and relpages are not updated on-the-fly,
+ and so they usually contain only approximate values (which is good
+ enough for the planner's purposes). They are initialized with dummy
+ values (presently 1000 and 10 respectively) when a table is created.
+ They are updated by certain commands, presently VACUUM>,
+ ANALYZE>, and CREATE INDEX>. A stand-alone
+ ANALYZE>, that is one not part of VACUUM>,
+ generates an approximate reltuples value
+ since it does not read every row of the table.
+
+
+ Most queries retrieve only a fraction of the rows in a table, due
+ to having WHERE clauses that restrict the rows to be examined.
+ The planner thus needs to make an estimate of the
+ selectivity> of WHERE clauses, that is, the fraction of
+ rows that match each clause of the WHERE condition. The information
+ used for this task is stored in the pg_statistic
+ system catalog. Entries in pg_statistic are
+ updated by ANALYZE> and VACUUM ANALYZE> commands,
+ and are always approximate even when freshly updated.
+
+
+ Rather than look at pg_statistic directly,
+ it's better to look at its view pg_stats
+ when examining the statistics manually. pg_stats
+ is designed to be more easily readable. Furthermore,
+ pg_stats is readable by all, whereas
+ pg_statistic is only readable by the superuser.
+ (This prevents unprivileged users from learning something about
+ the contents of other people's tables from the statistics. The
+ pg_stats view is restricted to show only
+ rows about tables that the current user can read.)
+ For example, we might do:
+
+
+regression=# select attname, n_distinct, most_common_vals from pg_stats where tablename = 'road';
+ attname | n_distinct | most_common_vals
+---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ name | -0.467008 | {"I- 580 Ramp","I- 880 Ramp","Sp Railroad ","I- 580 ","I- 680 Ramp","I- 80 Ramp","14th St ","5th St ","Mission Blvd","I- 880 "}
+ thepath | 20 | {"[(-122.089,37.71),(-122.0886,37.711)]"}
+(2 rows)
+regression=#
+
+
+ As of
Postgres 7.2 the following columns exist
+ in pg_stats:
+
+
+
+
pg_stats Columns
+
+
+
+ |
+ Name
+ Type
+ Description
+
+
+
+
+ |
+ tablename
+ name
+ Name of table containing column
+
+
+ |
+ attname
+ name
+ Column described by this row
+
+
+ |
+ null_frac
+ real
+ Fraction of column's entries that are NULL
+
+
+ |
+ avg_width
+ integer
+ Average width in bytes of column's entries
+
+
+ |
+ n_distinct
+ real
+ If greater than zero, the estimated number of distinct values
+ in the column. If less than zero, the negative of the number of
+ distinct values divided by the number of rows. (The negated form
+ is used when ANALYZE believes that the number of distinct values
+ is likely to increase as the table grows; the positive form is used
+ when the column seems to have a fixed number of possible values.)
+ For example, -1 indicates a unique column in which the number of
+ distinct values is the same as the number of rows.
+
+
+
+ |
+ most_common_vals
+ text[]
+ A list of the most common values in the column. (Omitted if
+ no values seem to be more common than any others.)
+
+
+ |
+ most_common_freqs
+ real[]
+ A list of the frequencies of the most common values,
+ ie, number of occurrences of each divided by total number of rows.
+
+
+
+ |
+ histogram_bounds
+ text[]
+ A list of values that divide the column's values into
+ groups of approximately equal population. The
+ most_common_vals>, if present, are omitted from the
+ histogram calculation. (Omitted if column datatype does not have a
+ <> operator, or if the most_common_vals>
+ list accounts for the entire population.)
+
+
+
+ |
+ correlation
+ real
+ Statistical correlation between physical row ordering and
+ logical ordering of the column values. This ranges from -1 to +1.
+ When the value is near -1 or +1, an indexscan on the column will
+ be estimated to be cheaper than when it is near zero, due to reduction
+ of random access to the disk. (Omitted if column datatype does
+ not have a <> operator.)
+
+
+
+
+
+
+ The maximum number of entries in the most_common_vals>
+ and histogram_bounds> arrays can be set on a
+ column-by-column basis using the ALTER TABLE SET STATISTICS>
+ command. The default limit is presently 10 entries. Raising the limit
+ may allow more accurate planner estimates to be made, particularly for
+ columns with irregular data distributions, at the price of consuming
+ more space in pg_statistic and slightly more
+ time to compute the estimates. Conversely, a lower limit may be
+ appropriate for columns with simple data distributions.
+
+
+
+
Controlling the Planner with Explicit JOINs