From 449e14a5618432f01066c33055229b96666bd925 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 21 May 2020 13:36:45 -0700
Subject: [PATCH] Doc: Describe CREATE INDEX deduplication strategy.

The B-Tree index deduplication strategy used during CREATE INDEX and
REINDEX differs from the lazy strategy used by retail inserts.  Make
that clear by adding a new paragraph to the B-Tree implementation
section of the documentation.

In passing, do some copy-editing of nearby deduplication documentation.
---
 doc/src/sgml/btree.sgml | 54 ++++++++++++++++++++++++++++-------------
 1 file changed, 37 insertions(+), 17 deletions(-)
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index e9cab4a55dc..73947db55cb 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -622,12 +622,13 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
   </para>
   <note>
    <para>
-    While NULL is generally not considered to be equal to any other
-    value, including NULL, NULL is nevertheless treated as just
-    another value from the domain of indexed values by the B-Tree
-    implementation (except when enforcing uniqueness in a unique
-    index).  B-Tree deduplication is therefore just as effective with
-    <quote>duplicates</quote> that contain a NULL value.
+    B-Tree deduplication is just as effective with
+    <quote>duplicates</quote> that contain a NULL value, even though
+    NULL values are never equal to each other according to the
+    <literal>=</literal> member of any B-Tree operator class.  As far
+    as any part of the implementation that understands the on-disk
+    B-Tree structure is concerned, NULL is just another value from the
+    domain of indexed values.
    </para>
   </note>
   <para>
@@ -642,6 +643,20 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
    see a moderate performance benefit from using deduplication.
    Deduplication is enabled by default.
   </para>
+  <para>
+   <command>CREATE INDEX</command> and <command>REINDEX</command>
+   apply deduplication to create posting list tuples, though the
+   strategy they use is slightly different.  Each group of duplicate
+   ordinary tuples encountered in the sorted input taken from the
+   table is merged into a posting list tuple
+   <emphasis>before</emphasis> being added to the current pending leaf
+   page.  Individual posting list tuples are packed with as many
+   <acronym>TID</acronym>s as possible.  Leaf pages are written out in
+   the usual way, without any separate deduplication pass.  This
+   strategy is well-suited to <command>CREATE INDEX</command> and
+   <command>REINDEX</command> because they are once-off batch
+   operations.
+  </para>
   <para>
    Write-heavy workloads that don't benefit from deduplication due to
    having few or no duplicate values in indexes will incur a small,
@@ -657,17 +672,22 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
    B-Tree indexes are not directly aware that under MVCC, there might
    be multiple extant versions of the same logical table row; to an
    index, each tuple is an independent object that needs its own index
-   entry.  Thus, an update of a row always creates all-new index
-   entries for the row, even if the key values did not change.  Some
-   workloads suffer from index bloat caused by these
-   implementation-level version duplicates (this is typically a
-   problem for <command>UPDATE</command>-heavy workloads that cannot
-   apply the <acronym>HOT</acronym> optimization due to modifying at
-   least one indexed column).  B-Tree deduplication does not
-   distinguish between these implementation-level version duplicates
-   and conventional duplicates.  Deduplication can nevertheless help
-   with controlling index bloat caused by implementation-level version
-   churn.
+   entry.  <quote>Version duplicates</quote> may sometimes accumulate
+   and adversely affect query latency and throughput.  This typically
+   occurs with <command>UPDATE</command>-heavy workloads where most
+   individual updates cannot apply the <acronym>HOT</acronym>
+   optimization (often because at least one indexed column gets
+   modified, necessitating a new set of index tuple versions &mdash;
+   one new tuple for <emphasis>each and every</emphasis> index).  In
+   effect, B-Tree deduplication ameliorates index bloat caused by
+   version churn.  Note that even the tuples from a unique index are
+   not necessarily <emphasis>physically</emphasis> unique when stored
+   on disk due to version churn.  The deduplication optimization is
+   selectively applied within unique indexes.  It targets those pages
+   that appear to have version duplicates.  The high level goal is to
+   give <command>VACUUM</command> more time to run before an
+   <quote>unnecessary</quote> page split caused by version churn can
+   take place.
   </para>
   <tip>
    <para>
-- 
2.39.5