|
pg_multixact>
Subdirectory containing multitransaction status data
- (used for shared row locks)
+ (used for shared row locks)
|
Each table and index is stored in a separate file, named after the table
or index's filenode> number, which can be found in
pg_class>.relfilenode>. In addition to the
-main file (aka. main fork), a free space map> (see
-) that stores information about free space
-available in the relation, is stored in a file named after the filenode
-number, with the _fsm> suffix. Tables also have a visibility map
-fork, with the _vm> suffix, to track which pages are known to have
-no dead tuples and therefore need no vacuuming.
+main file (a/k/a main fork), each table and index has a free space
+map> (see ), which stores information about free
+space available in the relation. The free space map is stored in a file named
+with the filenode number plus the suffix _fsm>. Tables also have a
+visibility map fork, with the suffix _vm>, to track which pages are
+known to have no dead tuples and therefore need no vacuuming.
(Actually, 1 GB is just the default segment size. The segment size can be
adjusted using the configuration option
when building
PostgreSQL>.)
+In principle, free space map and visibility map forks could require multiple
+segments as well, though this is unlikely to happen in practice.
The contents of tables and indexes are discussed further in
.
The name of a temporary file has the form
pgsql_tmpPPP>.NNN>,
where PPP> is the PID of the owning backend and
-NNN> distinguishes different files of that backend.
+NNN> distinguishes different temporary files of that backend.
PostgreSQL uses a fixed page size (commonly
8 kB), and does not allow tuples to span multiple pages. Therefore, it is
-not possible to store very large field values directly. To overcome
+not possible to store very large field values directly. To overcome
this limitation, large field values are compressed and/or broken up into
multiple physical rows. This happens transparently to the user, with only
-small impact on most of the backend code. The technique is affectionately
+small impact on most of the backend code. The technique is affectionately
known as
TOAST> (or the best thing since sliced bread>).
Free Space Map
-
-
+
+
-A Free Space Map is stored with every heap and index relation, except for
-hash indexes, to keep track of available space in the relation. It's stored
-along the main relation data, in a separate FSM relation fork, named after
-relfilenode of the relation, but with a _fsm> suffix. For example,
-if the relfilenode of a relation is 12345, the FSM is stored in a file called
+Each heap and index relation, except for hash indexes, has a Free Space Map
+(FSM) to keep track of available space in the relation. It's stored
+alongside the main relation data in a separate relation fork, named after the
+filenode number of the relation, plus a _fsm> suffix. For example,
+if the filenode of a relation is 12345, the FSM is stored in a file called
12345_fsm>, in the same directory as the main relation file.
The Free Space Map is organized as a tree of
FSM> pages. The
-bottom level
FSM> pages stores the free space available on every
-heap (or index) page, using one byte to represent each heap page. The upper
+bottom level
FSM> pages store the free space available on each
+heap (or index) page, using one byte to represent each such page. The upper
levels aggregate information from the lower levels.
See src/backend/storage/freespace/README> for more details on
how the
FSM> is structured, and how it's updated and searched.
- contrib module can be used to view the
-information stored in free space maps.
+The contrib/pg_freespacemap> module can be used to examine the
+information stored in free space maps (see ).
and pd_special). These contain byte offsets
from the page start to the start
of unallocated space, to the end of unallocated space, and to the start of
- the special space.
+ the special space.
The next 2 bytes of the page header,
pd_pagesize_version, store both the page size
and a version indicator. Beginning with
more than one page size in an installation.
The last field is a hint that shows whether pruning the page is likely
to be profitable: it tracks the oldest un-pruned XMAX on the page.
-
+
-
+
PageHeaderData Layout
PageHeaderData Layout
-
+
-
+ |
Field
Type
Length
-
+
The items themselves are stored in space allocated backwards from the end
of unallocated space. The exact structure varies depending on what the
table is to contain. Tables and sequences both use a structure named
HeapTupleHeaderData, described below.
-
+
-
+
The final section is the special section
which can
contain anything the access method wishes to store. For example,
b-tree indexes store links to the page's left and right siblings,
as well as some other data relevant to the index structure.
Ordinary tables do not use a special section at all (indicated by setting
pd_special> to equal the page size).
-
+
-
+
All table rows are structured in the same way. There is a fixed-size
t_hoff> a MAXALIGN multiple will appear between the null
bitmap and the object ID. (This in turn ensures that the object ID is
suitably aligned.)
-
+
-
+
HeapTupleHeaderData Layout
HeapTupleHeaderData Layout
-
+
-
+ |
Field
Type
Length
-
+
Interpreting the actual data can only be done with information obtained
from other tables, mostly pg_attribute. The
key values needed to identify field locations are
null values. All this trickery is wrapped up in the functions
heap_getattr, fastgetattr
and heap_getsysattr.
-
+
value and some flag bits. Depending on the flags, the data can be either
inline or in a
TOAST> table;
it might be compressed, too (see ).
-
+