--- /dev/null
+
+
+
+
+
Database Physical Storage
+
+This chapter provides an overview of the physical storage format used by
+
+
+
+
+
Database File Layout
+
+This section describes the storage format at the level of files and
+directories.
+
+
+All the data needed for a database cluster is stored within the cluster's data
+directory, commonly referred to as PGDATA> (after the name of the
+environment variable that can be used to define it). A common location for
+PGDATA> is /var/lib/pgsql/data>. Multiple clusters,
+managed by different postmasters, can exist on the same machine.
+
+
+The PGDATA> directory contains several subdirectories and control
+files, as shown in . In addition to
+these required items, the cluster configuration files
+postgresql.conf, pg_hba.conf, and
+pg_ident.conf are traditionally stored in
+PGDATA> (although beginning in
+
PostgreSQL 8.0 it is possible to keep them
+elsewhere).
+
+
+
+
Contents of PGDATA>
+
+
+|
+
+Item
+
+Description
+
+
+
+
+
+|
+ PG_VERSION>
+
A file containing the major version number of PostgreSQL
+
+
+|
+ base>
+ Subdirectory containing per-database subdirectories
+
+
+|
+ global>
+ Subdirectory containing cluster-wide tables, such as
+ pg_database>
+
+
+|
+ pg_clog>
+ Subdirectory containing transaction commit status data
+
+
+|
+ pg_subtrans>
+ Subdirectory containing subtransaction status data
+
+
+|
+ pg_tblspc>
+ Subdirectory containing symbolic links to tablespaces
+
+
+|
+ pg_xlog>
+ Subdirectory containing WAL (Write Ahead Log) files
+
+
+|
+ postmaster.opts>
+ A file recording the command-line options the postmaster was
+last started with
+
+
+|
+ postmaster.pid>
+ A lock file recording the current postmaster PID and shared memory
+segment ID (not present after postmaster shutdown)
+
+
+
+
+
+
+For each database in the cluster there is a subdirectory within
+PGDATA>/base>, named after the database's OID in
+pg_database>. This subdirectory is the default location
+for the database's files; in particular, its system catalogs are stored
+there.
+
+
+Each table and index is stored in a separate file, named after the table
+or index's filenode> number, which can be found in
+pg_class>.relfilenode>.
+
+
+
+Note that while a table's filenode often matches its OID, this is
+not> necessarily the case; some operations, like
+TRUNCATE>, REINDEX>, CLUSTER> and some forms
+of ALTER TABLE>, can change the filenode while preserving the OID.
+Avoid assuming that filenode and table OID are the same.
+
+
+
+When a table or index exceeds 1Gb, it is divided into gigabyte-sized
+segments>. The first segment's file name is the same as the
+filenode; subsequent segments are named filenode.1, filenode.2, etc.
+This arrangement avoids problems on platforms that have file size limitations.
+The contents of tables and indexes are discussed further in
+.
+
+
+A table that has columns with potentially large entries will have an
+associated TOAST> table, which is used for out-of-line storage of
+field values that are too large to keep in the table rows proper.
+pg_class>.reltoastrelid> links from a table to
+its
TOAST> table, if any.
+See for more information.
+
+
+Tablespaces make the scenario more complicated. Each user-defined tablespace
+has a symbolic link inside the PGDATA>/pg_tblspc>
+directory, which points to the physical tablespace directory (as specified in
+its CREATE TABLESPACE> command). The symbolic link is named after
+the tablespace's OID. Inside the physical tablespace directory there is
+a subdirectory for each database that has elements in the tablespace, named
+after the database's OID. Tables within that directory follow the filenode
+naming scheme. The pg_default> tablespace is not accessed through
+pg_tblspc>, but corresponds to
+PGDATA>/base>. Similarly, the pg_global>
+tablespace is not accessed through pg_tblspc>, but corresponds to
+PGDATA>/global>.
+
+
+
+
+
+
+
TOAST
+
+
+
+
+This section provides an overview of
TOAST> (The
+Oversized-Attribute Storage Technique).
+
+
+Since
PostgreSQL uses a fixed page size (commonly
+8Kb), and does not allow tuples to span multiple pages, it's not possible to
+store very large field values directly. Before
PostgreSQL> 7.1
+there was a hard limit of just under one page on the total amount of data that
+could be put into a table row. In release 7.1 and later, this limit is
+overcome by allowing large field values to be compressed and/or broken up into
+multiple physical rows. This happens transparently to the user, with only
+small impact on most of the backend code. The technique is affectionately
+known as
TOAST> (or the best thing since sliced bread>).
+
+
+Only certain data types support
TOAST> — there is no need to
+impose the overhead on data types that cannot produce large field values.
+To support
TOAST>, a data type must have a variable-length
+(varlena>) representation, in which the first 32-bit word of any
+stored value contains the total length of the value in bytes (including
+itself).
TOAST> does not constrain the rest of the representation.
+All the C-level functions supporting a
TOAST>-able data type must
+be careful to handle
TOAST>ed input values. (This is normally done
+by invoking PG_DETOAST_DATUM> before doing anything with an input
+value; but in some cases more efficient approaches are possible.)
+
+
+
TOAST> usurps the high-order two bits of the varlena length word,
+thereby limiting the logical size of any value of a
TOAST>-able
+data type to 1Gb (230> - 1 bytes). When both bits are zero,
+the value is an ordinary un-
TOAST>ed value of the data type. One
+of these bits, if set, indicates that the value has been compressed and must
+be decompressed before use. The other bit, if set, indicates that the value
+has been stored out-of-line. In this case the remainder of the value is
+actually just a pointer, and the correct data has to be found elsewhere. When
+both bits are set, the out-of-line data has been compressed too. In each case
+the length in the low-order bits of the varlena word indicates the actual size
+of the datum, not the size of the logical value that would be extracted by
+decompression or fetching of the out-of-line data.
+
+
+If any of the columns of a table are
TOAST>-able, the table will
+have an associated
TOAST> table, whose OID is stored in the table's
+pg_class>.reltoastrelid> entry. Out-of-line
+
TOAST>ed values are kept in the TOAST> table, as
+described in more detail below.
+
+
+The compression technique used is a fairly simple and very fast member
+of the LZ family of compression techniques. See
+src/backend/utils/adt/pg_lzcompress.c> for the details.
+
+
+Out-of-line values are divided (after compression if used) into chunks of at
+most TOAST_MAX_CHUNK_SIZE> bytes (this value is a little less than
+BLCKSZ/4>, or about 2000 bytes by default). Each chunk is stored
+as a separate row in the
TOAST> table for the owning table. Every
+
TOAST> table has the columns chunk_id> (an OID
+identifying the particular
TOAST>ed value),
+chunk_seq> (a sequence number for the chunk within its value),
+and chunk_data> (the actual data of the chunk). A unique index
+on chunk_id> and chunk_seq> provides fast
+retrieval of the values. A pointer datum representing an out-of-line
+
TOAST>ed value therefore needs to store the OID of the
+
TOAST> table in which to look and the OID of the specific value
+(its chunk_id>). For convenience, pointer datums also store the
+logical datum size (original uncompressed data length) and actual stored size
+(different if compression was applied). Allowing for the varlena header word,
+the total size of a
TOAST> pointer datum is therefore 20 bytes
+regardless of the actual size of the represented value.
+
+
+The
TOAST> code is triggered only
+when a row value to be stored in a table is wider than BLCKSZ/4>
+bytes (normally 2Kb). The
TOAST> code will compress and/or move
+field values out-of-line until the row value is shorter than
+BLCKSZ/4> bytes or no more gains can be had. During an UPDATE
+operation, values of unchanged fields are normally preserved as-is; so an
+UPDATE of a row with out-of-line values incurs no
TOAST> costs if
+none of the out-of-line values change.
+
+
+The
TOAST> code recognizes four different strategies for storing
+
+
+
+ PLAIN prevents either compression or
+ out-of-line storage. This is the only possible strategy for
+ columns of non-
TOAST>-able data types.
+
+
+
+ EXTENDED allows both compression and out-of-line
+ storage. This is the default for most
TOAST>-able data types.
+ Compression will be attempted first, then out-of-line storage if
+ the row is still too big.
+
+
+
+ EXTERNAL allows out-of-line storage but not
+ compression. Use of EXTERNAL will
+ make substring operations on wide text and
+ bytea columns faster (at the penalty of increased storage
+ space) because these operations are optimized to fetch only the
+ required parts of the out-of-line value when it is not compressed.
+
+
+
+ MAIN allows compression but not out-of-line
+ storage. (Actually, out-of-line storage will still be performed
+ for such columns, but only as a last resort when there is no other
+ way to make the row small enough.)
+
+
+
+
+Each
TOAST>-able data type specifies a default strategy for columns
+of that data type, but the strategy for a given table column can be altered
+with ALTER TABLE SET STORAGE>.
+
+
+This scheme has a number of advantages compared to a more straightforward
+approach such as allowing row values to span pages. Assuming that queries are
+usually qualified by comparisons against relatively small key values, most of
+the work of the executor will be done using the main row entry. The big values
+of
TOAST>ed attributes will only be pulled out (if selected at all)
+at the time the result set is sent to the client. Thus, the main table is much
+smaller and more of its rows fit in the shared buffer cache than would be the
+case without any out-of-line storage. Sort sets shrink also, and sorts will
+more often be done entirely in memory. A little test showed that a table
+containing typical HTML pages and their URLs was stored in about half of the
+raw data size including the
TOAST> table, and that the main table
+contained only about 10% of the entire data (the URLs and some small HTML
+pages). There was no runtime difference compared to an un-
TOAST>ed
+comparison table, in which all the HTML pages were cut down to 7Kb to fit.
+
+
+
+
+
+
+
Database Page Layout
+
+This section provides an overview of the page format used within
+
PostgreSQL tables and indexes.
+ Actually, index access methods need not use this page format.
+ All the existing index methods do use this basic format,
+ but the data kept on index metapages usually doesn't follow
+ the item layout rules.
+
+
+Sequences and
TOAST> tables are formatted just like a regular table.
+
+
+In the following explanation, a
+byte
+is assumed to contain 8 bits. In addition, the term
+item
+refers to an individual data value that is stored on a page. In a table,
+an item is a row; in an index, an item is an index entry.
+
+
+Every table and index is stored as an array of pages> of a
+fixed size (usually 8Kb, although a different page size can be selected
+when compiling the server). In a table, all the pages are logically
+equivalent, so a particular item (row) can be stored in any page. In
+indexes, the first page is generally reserved as a metapage>
+holding control information, and there may be different types of pages
+within the index, depending on the index access method.
+
+
+ shows the overall layout of a page.
+There are five parts to each page.
+
+
+
+
Overall Page Layout
+Page Layout
+
+
+|
+
+Item
+
+Description
+
+
+
+
+
+|
+ PageHeaderData
+ 20 bytes long. Contains general information about the page, including
+free space pointers.
+
+
+|
+ItemPointerData
+Array of (offset,length) pairs pointing to the actual items.
+4 bytes per item.
+
+
+|
+Free space
+The unallocated space. New item pointers are allocated from the start
+of this area, new items from the end.
+
+
+|
+Items
+The actual items themselves.
+
+
+|
+Special space
+Index access method specific data. Different methods store different
+data. Empty in ordinary tables.
+
+
+
+
+
+
+
+ The first 20 bytes of each page consists of a page header
+ (PageHeaderData). Its format is detailed in
+ linkend="pageheaderdata-table">. The first two fields track the most
+ recent WAL entry related to this page. They are followed by three 2-byte
+ integer fields
+ (pd_lower, pd_upper,
+ and pd_special). These contain byte offsets
+ from the page start to the start
+ of unallocated space, to the end of unallocated space, and to the start of
+ the special space.
+ The last 2 bytes of the page header,
+ pd_pagesize_version, store both the page size
+ and a version indicator. Beginning with
+
PostgreSQL 8.0 the version number is 2;
+
PostgreSQL 7.3 and 7.4 used version number 1;
+ prior releases used version number 0.
+ (The basic page layout and header format has not changed in these versions,
+ but the layout of heap row headers has.) The page size
+ is basically only present as a cross-check; there is no support for having
+ more than one page size in an installation.
+
+
+
+
+
PageHeaderData Layout
+ PageHeaderData Layout
+
+
+
+ Field
+ Type
+ Length
+ Description
+
+
+
+ |
+ pd_lsn
+ XLogRecPtr
+ 8 bytes
+ LSN: next byte after last byte of xlog record for last change
+ to this page
+
+ |
+ pd_tli
+ TimeLineID
+ 4 bytes
+ TLI of last change
+
+ |
+ pd_lower
+ LocationIndex
+ 2 bytes
+ Offset to start of free space
+
+ |
+ pd_upper
+ LocationIndex
+ 2 bytes
+ Offset to end of free space
+
+ |
+ pd_special
+ LocationIndex
+ 2 bytes
+ Offset to start of special space
+
+ |
+ pd_pagesize_version
+ uint16
+ 2 bytes
+ Page size and layout version number information
+
+
+
+
+
+ All the details may be found in
+ src/include/storage/bufpage.h.
+
+
+
+ Following the page header are item identifiers
+ (ItemIdData), each requiring four bytes.
+ An item identifier contains a byte-offset to
+ the start of an item, its length in bytes, and a few attribute bits
+ which affect its interpretation.
+ New item identifiers are allocated
+ as needed from the beginning of the unallocated space.
+ The number of item identifiers present can be determined by looking at
+ pd_lower>, which is increased to allocate a new identifier.
+ Because an item
+ identifier is never moved until it is freed, its index may be used on a
+ long-term basis to reference an item, even when the item itself is moved
+ around on the page to compact free space. In fact, every pointer to an
+ item (ItemPointer, also known as
+ CTID) created by
+
PostgreSQL consists of a page number and the
+ index of an item identifier.
+
+
+
+
+ The items themselves are stored in space allocated backwards from the end
+ of unallocated space. The exact structure varies depending on what the
+ table is to contain. Tables and sequences both use a structure named
+ HeapTupleHeaderData, described below.
+
+
+
+
+ The final section is the special section
which may
+ contain anything the access method wishes to store. For example,
+ b-tree indexes store links to the page's left and right siblings,
+ as well as some other data relevant to the index structure.
+ Ordinary tables do not use a special section at all (indicated by setting
+ pd_special> to equal the page size).
+
+
+
+
+ All table rows are structured in the same way. There is a fixed-size
+ header (occupying 27 bytes on most machines), followed by an optional null
+ bitmap, an optional object ID field, and the user data. The header is
+ detailed
+ in . The actual user data
+ (columns of the row) begins at the offset indicated by
+ t_hoff>, which must always be a multiple of the MAXALIGN
+ distance for the platform.
+ The null bitmap is
+ only present if the HEAP_HASNULL bit is set in
+ t_infomask. If it is present it begins just after
+ the fixed header and occupies enough bytes to have one bit per data column
+ (that is, t_natts> bits altogether). In this list of bits, a
+ 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
+ present, all columns are assumed not-null.
+ The object ID is only present if the HEAP_HASOID bit
+ is set in t_infomask. If present, it appears just
+ before the t_hoff> boundary. Any padding needed to make
+ t_hoff> a MAXALIGN multiple will appear between the null
+ bitmap and the object ID. (This in turn ensures that the object ID is
+ suitably aligned.)
+
+
+
+
+
HeapTupleHeaderData Layout
+ HeapTupleHeaderData Layout
+
+
+
+ Field
+ Type
+ Length
+ Description
+
+
+
+ |
+ t_xmin
+ TransactionId
+ 4 bytes
+ insert XID stamp
+
+ |
+ t_cmin
+ CommandId
+ 4 bytes
+ insert CID stamp
+
+ |
+ t_xmax
+ TransactionId
+ 4 bytes
+ delete XID stamp
+
+ |
+ t_cmax
+ CommandId
+ 4 bytes
+ delete CID stamp (overlays with t_xvac)
+
+ |
+ t_xvac
+ TransactionId
+ 4 bytes
+ XID for VACUUM operation moving a row version
+
+ |
+ t_ctid
+ ItemPointerData
+ 6 bytes
+ current TID of this or newer row version
+
+ |
+ t_natts
+ int16
+ 2 bytes
+ number of attributes
+
+ |
+ t_infomask
+ uint16
+ 2 bytes
+ various flag bits
+
+ |
+ t_hoff
+ uint8
+ 1 byte
+ offset to user data
+
+
+
+
+
+ All the details may be found in
+ src/include/access/htup.h.
+
+
+
+ Interpreting the actual data can only be done with information obtained
+ from other tables, mostly pg_attribute. The
+ key values needed to identify field locations are
+ attlen and attalign.
+ There is no way to directly get a
+ particular attribute, except when there are only fixed width fields and no
+ NULLs. All this trickery is wrapped up in the functions
+ heap_getattr, fastgetattr
+ and heap_getsysattr.
+
+
+
+ To read the data you need to examine each attribute in turn. First check
+ whether the field is NULL according to the null bitmap. If it is, go to
+ the next. Then make sure you have the right alignment. If the field is a
+ fixed width field, then all the bytes are simply placed. If it's a
+ variable length field (attlen = -1) then it's a bit more complicated.
+ All variable-length datatypes share the common header structure
+ varattrib, which includes the total length of the stored
+ value and some flag bits. Depending on the flags, the data may be either
+ inline or in a
TOAST> table;
+ it might be compressed, too (see ).
+
+
+
+
+
+
+ and user-defined types
+
If the values of your data type might exceed a few hundred bytes in
size (in internal form), you should make the data type
- user-defined types To do this, the internal
+ To do this, the internal
representation must follow the standard layout for variable-length
data: the first four bytes must be an int32 containing
the total length in bytes of the datum (including itself). The C