-
-
Architecture
-
-
-
Postgres Architectural Concepts
-
- Before we continue, you should understand the basic
-
Postgres system architecture. Understanding how the
- parts of
Postgres interact will make the next chapter
- somewhat clearer.
- In database jargon,
Postgres uses a simple "process
- per-user" client/server model. A
Postgres session
- consists of the following cooperating UNIX processes (programs):
-
-
-
- A supervisory daemon process (
postmaster ),
-
-
-
- the user's frontend application (e.g., the
psql program), and
-
-
-
- the one or more backend database servers (the
postgres process itself).
-
-
-
-
-
- A single
postmaster manages a given collection of
- databases on a single host. Such a collection of
- databases is called an installation or site. Frontend
- applications that wish to access a given database
- within an installation make calls to the library.
- The library sends user requests over the network to the
-
postmaster (
(a)), which in turn starts a new
- backend server process ((b))
-
-
-
How a connection is established
-![]()
-
-
- and connects the
- frontend process to the new server ((c)). From
- that point on, the frontend process and the backend
- server communicate without intervention by the
-
postmaster . Hence, the
postmaster is always running, waiting
- for requests, whereas frontend and backend processes
- come and go. The libpq library allows a single
- frontend to make multiple connections to backend processes.
- However, the frontend application is still a
- single-threaded process. Multithreaded frontend/backend
- connections are not currently supported in libpq .
- One implication of this architecture is that the
-
postmaster and the backend always run on the same
- machine (the database server), while the frontend
- application may run anywhere. You should keep this
- in mind,
- because the files that can be accessed on a client
- machine may not be accessible (or may only be accessed
- using a different filename) on the database server
- machine.
- You should also be aware that the
postmaster and
- postgres servers run with the user-id of the
Postgres
- "superuser." Note that the
Postgres superuser does not
- have to be a special user (e.g., a user named
- "postgres"). Furthermore, the
Postgres superuser
- should
- definitely not be the UNIX superuser, "root"! In any
- case, all files relating to a database should belong to
- this
Postgres superuser.
-
-
-
+
+
PostgreSQL from the Developer's Point of View
+
+ This chapter gives an overview of the internal structure of the
+ After having read the following sections you
+ should have an idea of how a query is processed. Don't expect a
+ detailed description here (I think such a description dealing with
+ all data structures and functions used within
Postgres
+ would exceed 1000
+ pages!). This chapter is intended to help understanding the general
+ control and data flow within the backend from receiving a query to
+ sending the results.
+
+
+
+
The Path of a Query
+
+ Here we give a short overview of the stages a query has to pass in
+ order to obtain a result.
+
+
+
+ A connection from an application program to the
Postgres
+ server has to be established. The application program transmits a
+ query to the server and receives the results sent back by the server.
+
+
+
+
+ The parser stage checks the query
+ transmitted by the application
+ program (client) for correct syntax and creates
+ a query tree .
+
+
+
+
+ The rewrite system takes
+ the query tree created by the parser stage and looks for
+ any rules (stored in the
+ system catalogs ) to apply to
+ the querytree and performs the
+ transformations given in the rule bodies .
+ One application of the rewrite system is given in the realization of
+ views .
+
+
+ Whenever a query against a view
+ (i.e. a virtual table ) is made,
+ the rewrite system rewrites the user's query to
+ a query that accesses the base tables given in
+ the view definition instead.
+
+
+
+
+ The planner/optimizer takes
+ the (rewritten) querytree and creates a
+ queryplan that will be the input to the
+ executor .
+
+
+ It does so by first creating all possible paths
+ leading to the same result. For example if there is an index on a
+ relation to be scanned, there are two paths for the
+ scan. One possibility is a simple sequential scan and the other
+ possibility is to use the index. Next the cost for the execution of
+ each plan is estimated and the
+ cheapest plan is chosen and handed back.
+
+
+
+
+ The executor recursively steps through
+ the plan tree and
+ retrieves tuples in the way represented by the plan.
+ The executor makes use of the
+ storage system while scanning
+ relations, performs sorts and joins ,
+ evaluates qualifications and finally hands back the tuples derived.
+
+
+
+
+ In the following sections we will cover every of the above listed items
+ in more detail to give a better understanding on
Postgres 's internal
+ control and data structures.
+
+
+
+
+
How Connections are Established
+
+
Postgres is implemented using a simple "process per-user"
+ client/server model. In this model there is one client process
+ connected to exactly one server process .
+ As we don't know per se
+ how many connections will be made, we have to use a master process
+ that spawns a new server process every time a connection is
+ requested. This master process is called postmaster and
+ listens at a specified TCP/IP port for incoming connections. Whenever
+ a request for a connection is detected the postmaster process
+ spawns a new server process called postgres . The server
+ tasks (postgres processes) communicate with each other using
+ semaphores and shared memory
+ to ensure data integrity
+ throughout concurrent data access. Figure
+ \ref{connection} illustrates the interaction of the master process
+ postmaster the server process postgres and a client
+ application.
+
+
+ The client process can either be the
psql frontend (for
+ interactive SQL queries) or any user application implemented using
+ the libpg library. Note that applications implemented using
+ (the
Postgres embedded SQL preprocessor for C)
+ also use this library.
+
+
+ Once a connection is established the client process can send a query
+ to the backend (server). The query is transmitted using plain text,
+ i.e. there is no parsing done in the frontend (client). The
+ server parses the query, creates an execution plan ,
+ executes the plan and returns the retrieved tuples to the client
+ by transmitting them over the established connection.
+
+
+
+
+
+
+
+
The Parser Stage
+
+ The parser stage consists of two parts:
+
+
+
+ The parser defined in
+ gram.y and scan.l is
+ built using the UNIX tools
yacc
+
+
+
+ The transformation process does
+ modifications and augmentations to the data structures returned by the parser.
+
+
+
+
+
+
+
Parser
+
+ The parser has to check the query string (which arrives as
+ plain ASCII text) for valid syntax. If the syntax is correct a
+ parse tree is built up and handed back otherwise an error is
+ returned. For the implementation the well known UNIX
+ are used.
+
+
+ The lexer is defined in the file
+ scan.l and is responsible
+ for recognizing identifiers ,
+ the SQL keywords etc. For
+ every keyword or identifier that is found, a token
+ is generated and handed to the parser.
+
+
+ The parser is defined in the file gram.y and consists of a
+ set of grammar rules and actions
+ that are executed
+ whenever a rule is fired. The code of the actions (which
+ is actually C-code) is used to build up the parse tree.
+
+
+ The file scan.l is transformed to
+ the C-source file scan.c
+ and gram.y is transformed to
+ After these transformations have taken
+ place a normal C-compiler can be used to create the
+ parser. Never make any changes to the generated C-files as they will
+ be overwritten the next time
lex
+
+
+ The mentioned transformations and compilations are normally done
+ automatically using the makefiles
+ shipped with the
Postgres
+ source distribution.
+
+
+
+
+ A detailed description of
yacc or
+ the grammar rules given
+ in gram.y would be beyond the scope of this paper. There are
+ many books and documents dealing with
lex
+ should be familiar with
yacc before you start to study the
+ grammar given in gram.y otherwise you won't understand what
+ happens there.
+
+
+ For a better understanding of the data structures used in
Postgres
+ for the processing of a query we use an example to illustrate the
+ changes made to these data structures in every stage.
+
+
+
+
A Simple Select
+
+ This example contains the following simple query that will be used in
+ various descriptions and figures throughout the following
+ sections. The query assumes that the tables given in
+ The Supplier Database
+
+ have already been defined.
+
+ select s.sname, se.pno
+ from supplier s, sells se
+ where s.sno > 2 and
+ s.sno = se.sno;
+
+
+
+
+ Figure \ref{parsetree} shows the parse tree built by the
+ grammar rules and actions given in gram.y for the query
+ given in
+ (without the operator tree for
+ the where clause which is shown in figure \ref{where_clause}
+ because there was not enough space to show both data structures in one
+ figure).
+
+
+ The top node of the tree is a SelectStmt node. For every entry
+ appearing in the from clause of the SQL query a RangeVar
+ node is created holding the name of the alias and a pointer to a
+ RelExpr node holding the name of the relation . All
+ RangeVar nodes are collected in a list which is attached to the field
+ fromClause of the SelectStmt node.
+
+
+ For every entry appearing in the select list of the SQL query a
+ ResTarget node is created holding a pointer to an Attr
+ node. The Attr node holds the relation name of the entry and
+ a pointer to a Value node holding the name of the
+ attribute .
+ All ResTarget nodes are collected to a list which is
+ connected to the field targetList of the SelectStmt node.
+
+
+ Figure \ref{where_clause} shows the operator tree built for the
+ where clause of the SQL query given in example
+
+ which is attached to the field
+ qual of the SelectStmt node. The top node of the
+ operator tree is an A_Expr node representing an AND
+ operation. This node has two successors called lexpr and
+ rexpr pointing to two subtrees . The subtree attached to
+ lexpr represents the qualification s.sno > 2 and the one
+ attached to rexpr represents s.sno = se.sno . For every
+ attribute an Attr node is created holding the name of the
+ relation and a pointer to a Value node holding the name of the
+ attribute. For the constant term appearing in the query a
+ Const node is created holding the value.
+
+
+
+
+
+
+
+
Transformation Process
+
+ The transformation process takes the tree handed back by
+ the parser as input and steps recursively through it. If
+ a SelectStmt node is found, it is transformed
+ to a Query
+ node which will be the top most node of the new data structure. Figure
+ \ref{transformed} shows the transformed data structure (the part
+ for the transformed where clause is given in figure
+ \ref{transformed_where} because there was not enough space to show all
+ parts in one figure).
+
+
+ Now a check is made, if the relation names in the
+ FROM clause are known to the system. For every relation name
+ that is present in the
system catalogs a
RTE node is
+ created containing the relation name, the alias name and
+ the relation id . From now on the relation ids are used to
+ refer to the
relations given in the query. All
RTE nodes
+ are collected in the range table entry list which is connected
+ to the field rtable of the Query node. If a name of a
+ relation that is not known to the system is detected in the query an
+ error will be returned and the query processing will be aborted.
+
+
+ Next it is checked if the attribute names used are
+ contained in the relations given in the query. For every
+ attribute} that is found a
TLE node is created holding a pointer
+ to a Resdom node (which holds the name of the column) and a
+ pointer to a VAR node. There are two important numbers in the
+ VAR node. The field varno gives the position of the
+ relation containing the current attribute} in the range
+ table entry list created above. The field varattno gives the
+ position of the attribute within the relation. If the name
+ of an attribute cannot be found an error will be returned and
+ the query processing will be aborted.
+
+
+
+
+
+
+
+
The Postgres Rule System
+
+
Postgres supports a powerful
+ rule system for the specification
+ of views and ambiguous view updates .
+ Originally the
Postgres
+ rule system consisted of two implementations:
+
+
+
+ The first one worked using tuple level processing and was
+ implemented deep in the executor . The rule system was
+ called whenever an individual tuple had been accessed. This
+ implementation was removed in 1995 when the last official release
+ of the
Postgres project was transformed into
+
+
+
+
+ The second implementation of the rule system is a technique
+ called query rewriting .
+ The rewrite system } is a module
+ that exists between the parser stage and the
+ planner/optimizer . This technique is still implemented.
+
+
+
+
+
+ For information on the syntax and creation of rules in the
+
Postgres system refer to
+ The PostgreSQL User's Guide .
+
+
+
+
The Rewrite System
+
+ The query rewrite system is a module between
+ the parser stage and the planner/optimizer. It processes the tree handed
+ back by the parser stage (which represents a user query) and if
+ there is a rule present that has to be applied to the query it
+ rewrites the tree to an alternate form.
+
+
+
+
Techniques To Implement Views
+
+ Now will sketch the algorithm of the query rewrite system. For
+ better illustration we show how to implement views using rules
+ as an example.
+
+
+ Let the following rule be given:
+
+ create rule view_rule
+ as on select
+ to test_view
+ do instead
+ select s.sname, p.pname
+ from supplier s, sells se, part p
+ where s.sno = se.sno and
+ p.pno = se.pno;
+
+
+
+ The given rule will be fired whenever a select
+ against the relation test_view is detected. Instead of
+ selecting the tuples from test_view the select statement
+ given in the action part of the rule is executed.
+
+
+ Let the following user-query against test_view be given:
+
+ select sname
+ from test_view
+ where sname <> 'Smith';
+
+
+
+ Here is a list of the steps performed by the query rewrite
+ system whenever a user-query against test_view appears. (The
+ following listing is a very informal description of the algorithm just
+ intended for basic understanding. For a detailed description refer
+ to ).
+
+
+
test_view Rewrite
+
+ Take the query given in the action part of the rule.
+
+
+
+
+ Adapt the targetlist to meet the number and order of
+ attributes given in the user-query.
+
+
+
+
+ Add the qualification given in the where clause of the
+ user-query to the qualification of the query given in the
+ action part of the rule.
+
+
+
+
+ Given the rule definition above, the user-query will be
+ rewritten to the following form (Note that the rewriting is done on
+ the internal representation of the user-query handed back by the
+ parser stage but the derived new data structure will represent the following
+ query):
+
+ select s.sname
+ from supplier s, sells se, part p
+ where s.sno = se.sno and
+ p.pno = se.pno and
+ s.sname <> 'Smith';
+
+
+
+
+
+
+
Planner/Optimizer
+
+ The task of the planner/optimizer is to create an optimal
+ execution plan. It first combines all possible ways of
+ scanning and joining
+ the relations that appear in a
+ query. All the created paths lead to the same result and it's the
+ task of the optimizer to estimate the cost of executing each path and
+ find out which one is the cheapest.
+
+
+
+
Generating Possible Plans
+
+ The planner/optimizer decides which plans should be generated
+ based upon the types of indices defined on the relations appearing in
+ a query. There is always the possibility of performing a
+ sequential scan on a relation, so a plan using only
+ sequential scans is always created. Assume an index is defined on a
+ relation (for example a B-tree index) and a query contains the
+ restriction
+ relation.attribute OPR constant . If
+ relation.attribute happens to match the key of the B-tree
+ index and OPR is anything but '<>' another plan is created using
+ the B-tree index to scan the relation. If there are further indices
+ present and the restrictions in the query happen to match a key of an
+ index further plans will be considered.
+
+
+ After all feasible plans have been found for scanning single
+ relations, plans for joining relations are created. The
+ planner/optimizer considers only joins between every two relations for
+ which there exists a corresponding join clause (i.e. for which a
+ restriction like where rel1.attr1=rel2.attr2 exists) in the
+ where qualification. All possible plans are generated for every
+ join pair considered by the planner/optimizer. The three possible join
+ strategies are:
+
+
+
+ nested iteration join : The right relation is scanned
+ once for every tuple found in the left relation. This strategy
+ is easy to implement but can be very time consuming.
+
+
+
+
+ merge sort join : Each relation is sorted on the join
+ attributes before the join starts. Then the two relations are
+ merged together taking into account that both relations are
+ ordered on the join attributes. This kind of join is more
+ attractive because every relation has to be scanned only once.
+
+
+
+
+ hash join : the right relation is first hashed on its
+ join attributes. Next the left relation is scanned and the
+ appropriate values of every tuple found are used as hash keys to
+ locate the tuples in the right relation.
+
+
+
+
+
+
+
+
Data Structure of the Plan
+
+ Here we will give a little description of the nodes appearing in the
+ plan. Figure \ref{plan} shows the plan produced for the query in
+ example \ref{simple_select}.
+
+
+ The top node of the plan is a MergeJoin node which has two
+ successors, one attached to the field lefttree and the second
+ attached to the field righttree . Each of the subnodes represents
+ one relation of the join. As mentioned above a merge sort
+ join requires each relation to be sorted. That's why we find
+ a Sort node in each subplan. The additional qualification given
+ in the query (s.sno > 2 ) is pushed down as far as possible and is
+ attached to the qpqual field of the leaf SeqScan node of
+ the corresponding subplan.
+
+
+ The list attached to the field mergeclauses of the
+ MergeJoin node contains information about the join attributes.
+ The values 65000 and 65001
+ for the varno fields in the
+ VAR nodes appearing in the mergeclauses list (and also in the
+ targetlist ) mean that not the tuples of the current node should be
+ considered but the tuples of the next "deeper" nodes (i.e. the top
+ nodes of the subplans) should be used instead.
+
+
+ Note that every Sort and SeqScan node appearing in figure
+ \ref{plan} has got a targetlist but because there was not enough space
+ only the one for the MergeJoin node could be drawn.
+
+
+ Another task performed by the planner/optimizer is fixing the
+ operator ids in the Expr
+ and Oper nodes. As
+ mentioned earlier,
Postgres supports a variety of different data
+ types and even user defined types can be used. To be able to maintain
+ the huge amount of functions and operators it is necessary to store
+ them in a system table. Each function and operator gets a unique
+ operator id. According to the types of the attributes used
+ within the qualifications etc., the appropriate operator ids
+ have to be used.
+
+
+
+
+
+
Executor
+
+ The executor takes the plan handed back by the
+ planner/optimizer and starts processing the top node. In the case of
+ our example (the query given in example \ref{simple_select}) the top
+ node is a MergeJoin node.
+
+
+ Before any merge can be done two tuples have to be fetched (one from
+ each subplan). So the executor recursively calls itself to
+ process the subplans (it starts with the subplan attached to
+ lefttree ). The new top node (the top node of the left subplan) is a
+ SeqScan node and again a tuple has to be fetched before the node
+ itself can be processed. The executor calls itself recursively
+ another time for the subplan attached to lefttree of the
+ SeqScan node.
+
+
+ Now the new top node is a Sort node. As a sort has to be done on
+ the whole relation, the executor starts fetching tuples
+ from the Sort node's subplan and sorts them into a temporary
+ relation (in memory or a file) when the Sort node is visited for
+ the first time. (Further examinations of the Sort node will
+ always return just one tuple from the sorted temporary
+ relation.)
+
+
+ Every time the processing of the Sort node needs a new tuple the
+ executor is recursively called for the SeqScan node
+ attached as subplan. The relation (internally referenced by the
+ value given in the scanrelid field) is scanned for the next
+ tuple. If the tuple satisfies the qualification given by the tree
+ attached to qpqual it is handed back, otherwise the next tuple
+ is fetched until the qualification is satisfied. If the last tuple of
+ the relation has been processed a NULL pointer is
+ returned.
+
+
+ After a tuple has been handed back by the lefttree of the
+ MergeJoin the righttree is processed in the same way. If both
+ tuples are present the executor processes the MergeJoin
+ node. Whenever a new tuple from one of the subplans is needed a
+ recursive call to the executor is performed to obtain it. If a
+ joined tuple could be created it is handed back and one complete
+ processing of the plan tree has finished.
+
+
+ Now the described steps are performed once for every tuple, until a
+ NULL pointer is returned for the processing of the
+ MergeJoin node, indicating that we are finished.
+
+
+
+
+
+
+
+
+