-Received: from iconmail.bellatlantic.net (iconmail.bellatlantic.net [199.173.162.30])
-Received: from bellatlantic.net (client196-126-169.bellatlantic.net [151.196.126.169])
-> > > > that users were 'screaming for' may be a thing of the past.
-> > > insightdist and Vadim.
-> > desktop libraries and end-user tools depend on this "record set" strategy to
-> > operate effectively.
-> > for each OR grouping).
-> > The results are profound. Queries that used to scan tables because of the
-> > ORs, now make use of any indexes. Thus, the size of the table has virtually
-> > backend, now run in under a second.
-> > 1. If there is no usable index, performance is significantly worse. The
-> > patch does not check to make sure that there is a usable index. I could use
-> > some pointers on this.
-> > 2. Small tables are actually a bit slower than without the patch.
-> > 3. Not very elegant. I am looking for a more generalized solution.
-> > I have lots of ideas, but I would need to know the backend much better before
-> > attempting any of them. My favorite idea is before cnfify(), to factor the
-> > OR terms and pull out the constants into a virtual (temporary) table spaces.
-> > Then rewrite the query as a join. The optimizer will (should) treat the new
-> > query accordingly. This assumes that an efficient factoring algorithm exists
-> > and that temporary tables can exist in the heap.
-If the question is, is it still important: Very much yes.
-It is essential for many RAD tools using remote data objects which make use of key
-users are confronted with multi-part keys.
-with the backend to be mucking around in the optimizer. As I see it, cnf'ify is
-doing just what it is supposed to do. Boundless boolean logic.
-I think hope may lay though, in identifying each AND'ed group associated with a key
-and tagging it as a special sub-root node which cnf'ify does not penetrate. This
-node would be allowed to pass to the later stages of the optimizer where it will be
-used to plan index scans. Easy for me to say.
-In the meantime, I still have the patch that I described in prior email. It has
-However, I do not feel that is a sufficiently functional approach that should be
-incorporated as a final solution. I will submit the patch if you, (anyone) does
-not come up with a better solution. It is coded to be activated by a SET KSQO to
-minimize its reach.
-Received: from iconmail.bellatlantic.net (iconmail.bellatlantic.net [199.173.162.30])
-Received: from bellatlantic.net (client196-126-73.bellatlantic.net [151.196.126.73])
-> OK, let me try this one.
-> restrictive/cheapest, and evaluate that one first.
-> to meet it, it can stop and not evaluate it using the other restriction.
-> qualification test on the remaining item if the first one tried is true.
-> but the memory explosion and query failure are serious problems.
-> it is actually hurting things.
-> (x=3 and y=6) or ...
-> clients is hurting reliability.
-> purpose in these cases.
-> making the OR a join to a temp table that holds all the constants.
-> WHERE (x=3 and y=6) ...
-> you are doing a sequential scan for each UNION.
-Practically speaking, the lack of an index concern, may not be justified. The reason
-these queries are being generated, with this shape, is because remote data objects on the
-client side are being told that a primary key exists on these tables. The object is told
-about these keys in one of two ways.
-1. It queries the database for the primary key of the table. The ODBC driver serviced
-this request by querying for the attributes used in {table_name}_pkey.
-2. The user manually specifies the primary key. In this case an actual index may not
-exist. (i.e. MS Access asks the user for this information if a primary key is not found
-The second case is the only one that would cause a problem. Fortunately, the solution is
-My only concern is to be able to accurately identify a query with the proper signature
-before rewriting it as a UNION. To what degree should this inspection be taken?
-BTW, I would not do the rewrite on OR's without AND's since you have fixed the OR's use
-of the index.
-There is one other potential issue. My experience with using arrays in tables and UNIONS
-creates problems. There are missing array comparison operators which are used by the
-implied DISTINCT.
-> them when the meet more than one qualification.
-> on tab, so it may be slow if x and y are indexed.
-> non-index join on lots of rows. Of course, now that we index OR's.
-> tests.
-> x=3, it is a column compared to a constant, while in x=y, it is a join.
-> That makes a huge difference.
-> because you also need the e=1 stuff.
-> OR represent only column = constant restrictions.
-> items took my time.
-> every other row looking for a match.
-> has a variety of paths to choose for evaluating the query.
-I did not try this earlier because I thought it was too good to be true. I was right.
-I tried commenting out the normalize() function in the cnfify(). The EXPLAIN showed a
-sequential scan and the resulting tuple set was empty. Time will not allow me to dig
-into this further this weekend.
-Unless you come up with a better solution, I am going to submit my patch on Monday to
-make the Sept. 1st deadline. It includes a SET switch to activate the rewrite so as not
-to cause problems outside the ODBC users. We can either improve, it or yank it, by the
-Oct. 1st deadline.
->> restrictive/cheapest, and evaluate that one first.
-database expert but I think we are talking about a NP-complete problem.
-missing the point.
-Received: from gandalf.telecom.at (gandalf.telecom.at [194.118.26.84]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id GAA21099 for ; Mon, 31 Aug 1998 06:23:41 -0400 (EDT)
-Received: from zeugswettera.user.lan.at (zeugswettera.user.lan.at [10.4.123.227]) by gandalf.telecom.at (A.B.C.Delta4/8.8.8) with SMTP id MAA38132; Mon, 31 Aug 1998 12:22:07 +0200
-> WHERE (x=3 and y=6) ...
->you are doing a sequential scan for each UNION.
-unique index. So I think this special case should work good.
-using this syntax.
-Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id BAA02180; Tue, 1 Sep 1998 01:48:43 -0400 (EDT)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 01 Sep 1998 01:47:48 +0000 (EDT)
-Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id BAA02160 for pgsql-patches-outgoing; Tue, 1 Sep 1998 01:47:46 -0400 (EDT)
-Received: from bellatlantic.net (client196-126-3.bellatlantic.net [151.196.126.3])
-Subject: [PATCHES] Interim AND/OR memory exaustion fix.
-This is a multi-part message in MIME format.
-I will be cleaning this up more before the Oct 1 deadline.
-+ indexes when multiple ORs are specified in the where clause.
-+ See optimizer/prep/prepkeyset.c for more on this.
- CFLAGS += -I../..
- CFLAGS += -I../..
-+ * Special preperation for keyset queries.
-+ #define Node_Copy(from, newnode, field) newnode->field = copyObject(from->field)
+++ /dev/null
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id TAA19488
- for ; Sun, 22 Jun 1997 19:31:03 -0400 (EDT)
-Received: from faerie.CS.Berkeley.EDU (faerie.CS.Berkeley.EDU [128.32.37.53]) by renoir.op.net ($ Revision: 1.12 $) with SMTP id TAA18795 for ; Sun, 22 Jun 1997 19:18:06 -0400 (EDT)
-Received: from localhost.Berkeley.EDU (localhost.Berkeley.EDU [127.0.0.1]) by faerie.CS.Berkeley.EDU (8.6.10/8.6.3) with SMTP id QAA07816 for
[email protected]; Sun, 22 Jun 1997 16:16:44 -0700
-X-Authentication-Warning: faerie.CS.Berkeley.EDU: Host localhost.Berkeley.EDU didn't use HELO protocol
-To: Bruce Momjian
-Subject: Re: PostgreSQL psort() function performance
-In-reply-to: Your message of Sun, 22 Jun 1997 09:45:31 -0400 (EDT)
-Date: Sun, 22 Jun 97 16:16:43 -0700
-X-Mts: smtp
-Status: OR
-
-the mariposa distribution (http://mariposa.cs.berkeley.edu/) contains
-some hacks to nodeSort.c and psort.c that
- - make psort read directly from the executor node below it
- (instead of an input relation)
- - makes the Sort node read directly from the last set of psort runs
- (instead of an output relation)
-speeds things up quite a bit. kind of ruins psort for other purposes,
-though (which is why nbtsort.c exists).
-
-i'd merge these in first and see how far that gets you.
---
- Paul M. Aoki | University of California at Berkeley
- | Berkeley, CA 94720-1776
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01676
- for ; Mon, 3 Nov 1997 09:31:02 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id JAA07345 for ; Mon, 3 Nov 1997 09:13:20 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA13315; Mon, 3 Nov 1997 08:50:26 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:48:07 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA11722 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:48:02 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA11539 for ; Mon, 3 Nov 1997 08:47:34 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id UAA19066; Mon, 3 Nov 1997 20:48:04 +0700 (KRS)
-Date: Mon, 03 Nov 1997 20:48:04 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Marc Howard Zuckman
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Marc Howard Zuckman wrote:
->
-> On Mon, 3 Nov 1997, Bruce Momjian wrote:
->
-> > With fsync off, I just did an insert of 1000 integers into a table
-> > containing a single int4 column and no indexes, and it completed in 2.3
-> > seconds. This is on the new source tree.. That is 434 inserts/second.
-> > Pretty major performance, or 2.3 ms/insert. This is on a idle PP200
-> > with UltraSCSI drives.
-> >
-> > With fsync on, the time goes to 51 seconds. Wow, big difference.
->
-> If better alternative error recovery methods were available, perhaps
-> a facility to replay an interval transactions log from a prior dump,
-> it would be reasonable to run the backend without fsync and
-> take advantage of the performance gains.
-
-???
-
->
-> I don't know the answer, but I suspect that the commercial databases
-> don't "fsync" the way pgsql does.
-
-Could someone try 1000 int4 inserts using postgres and
-some commercial database (on the same machine) ?
-
-Vadim
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id JAA01183
- for ; Mon, 3 Nov 1997 09:01:00 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id IAA06632 for ; Mon, 3 Nov 1997 08:51:58 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id IAA05964; Mon, 3 Nov 1997 08:39:39 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 08:37:32 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id IAA04729 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 08:37:26 -0500 (EST)
-Received: from fallon.classyad.com (
[email protected] [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id IAA04614 for
; Mon, 3 Nov 1997 08:37:16 -0500 (EST)
-Received: from fallon.classyad.com (
[email protected] [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id JAA22108; Mon, 3 Nov 1997 09:11:09 -0500
-Date: Mon, 3 Nov 1997 09:11:09 -0500 (EST)
-From: Marc Howard Zuckman
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Status: OR
-
-On Mon, 3 Nov 1997, Bruce Momjian wrote:
-
-> >
-> > Removed...
-> >
-> > Also, ItemPointerData t_chain (6 bytes) removed from HeapTupleHeader.
-> > CommandId is uint32 now (up to the 2^32 - 1 commands per transaction).
-> > DOUBLEALIGN(Sizeof(HeapTupleHeader)) is 40 bytes now.
-> >
-> > 1000 inserts (into table with single int4 column, 1 insert per transaction)
-> > takes 70 - 80 sec now (12.5 - 14 transactions/sec).
-> > This is hardware/OS limitation:
-> >
-> > fd = open ("t", O_RDWR);
-> > for (i = 1; i <= 1000; i++)
-> > {
-> > lseek(fd, 0, SEEK_END);
-> > write(fd, buf, 56);
-> > fsync(fd);
-> > }
-> > close (fd);
-> >
-> > takes 33 - 39 sec and so it's not possible to be faster
-> > having 2 fsync-s per transaction.
-> >
-> > The same test on 6.2.1: 92 - 107 sec
->
-> With fsync off, I just did an insert of 1000 integers into a table
-> containing a single int4 column and no indexes, and it completed in 2.3
-> seconds. This is on the new source tree.. That is 434 inserts/second.
-> Pretty major performance, or 2.3 ms/insert. This is on a idle PP200
-> with UltraSCSI drives.
->
-> With fsync on, the time goes to 51 seconds. Wow, big difference.
-
-If better alternative error recovery methods were available, perhaps
-a facility to replay an interval transactions log from a prior dump,
-it would be reasonable to run the backend without fsync and
-take advantage of the performance gains.
-
-I don't know the answer, but I suspect that the commercial databases
-don't "fsync" the way pgsql does.
-
-Marc Zuckman
-
-_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
-_ Visit The Home and Condo MarketPlace _
-_ http://www.ClassyAd.com _
-_ _
-_ FREE basic property listings/advertisements and searches. _
-_ _
-_ Try our premium, yet inexpensive services for a real _
-_ selling or buying edge! _
-_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
-
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA04080
- for ; Mon, 3 Nov 1997 11:31:00 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA13680 for ; Mon, 3 Nov 1997 11:21:30 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA07566; Mon, 3 Nov 1997 11:04:52 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:02:59 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA07372 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:02:52 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA07196 for
; Mon, 3 Nov 1997 11:02:22 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id KAA02525;
- Mon, 3 Nov 1997 10:42:03 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Mon, 3 Nov 1997 10:42:03 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-> > I don't know the answer, but I suspect that the commercial databases
-> > don't "fsync" the way pgsql does.
->
-> Could someone try 1000 int4 inserts using postgres and
-> some commercial database (on the same machine) ?
-
-I have been thinking about this since seeing the performance change
-with/without fsync.
-
-Commerical databases usually do a log write every 5 or 15 minutes, and
-guarantee the logs will contain everything up to this time interval.
-
-Couldn't we have some such mechanism? Usually they have raw space, so
-they can control when the data is hitting the disk. Using a file
-system, some of it may be getting to the disk without our knowing it.
-
-What exactly is a scenario where lack of doing explicit fsync's will
-cause data corruption, rather than just lost data from the past few
-minutes?
-
-I think Vadim has gotten fsync's down to fsync'ing the modified data
-page, and pg_log.
-
-Let's suppose we did not fsync. There could be cases where pg_log was
-fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
-the OS, but not others. This would leave us with a partial transaction.
-
-However, let's suppose we prevent pg_log from being fsync'ed somehow.
-Then, because we have a no-overwrite database, we could keep control of
-this, and write of some data pages, but not others would not cause us
-problems because the pg_log would show all such transactions, which had
-not had all their modified data pages fsync'ed, as non-committed.
-
-Perhaps we can even set a flag in pg_log every five minutes to indicate
-whether all buffers for the page have been flushed? That way we could
-not have to worry about preventing flushing of pg_log.
-
-Comments?
-
---
-Bruce Momjian
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA04456
- for ; Mon, 3 Nov 1997 12:00:40 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id LAA26054; Mon, 3 Nov 1997 11:46:49 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 11:46:33 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id LAA25932 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 11:46:30 -0500 (EST)
-Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id LAA25750 for ; Mon, 3 Nov 1997 11:45:53 -0500 (EST)
-Received: by orion.SAPserv.Hamburg.dsh.de
- (Linux Smail3.1.29.1 #1)}
- id m0xSPfE-000BGZC; Mon, 3 Nov 97 17:47 MET
-Message-Id:
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Mon, 3 Nov 1997 17:47:43 +0100 (MET)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=iso-8859-1
-Content-Transfer-Encoding: 8bit
-Precedence: bulk
-Status: OR
-
->
-> > > I don't know the answer, but I suspect that the commercial databases
-> > > don't "fsync" the way pgsql does.
-> >
-> > Could someone try 1000 int4 inserts using postgres and
-> > some commercial database (on the same machine) ?
->
-> I have been thinking about this since seeing the performance change
-> with/without fsync.
->
-> Commerical databases usually do a log write every 5 or 15 minutes, and
-> guarantee the logs will contain everything up to this time interval.
->
-
- Without fsync PostgreSQL would only loose data if the OS
- crashes between the last write operation of a backend and the
- next regular update sync. This is seldom but if it happens it
- really hurts.
-
- A database can omit fsync on data files (e.g. tablespaces) if
- it writes a redo log. With that redo log, a backup can be
- restored and than all transactions since the backup redone.
-
- PostgreSQL doesn't write such a redo log. So an OS crash
- after the fsync of pg_log could corrupt the database without
- a chance to recover.
-
- Isn't it time to get an (optional) redo log. I don't exactly
- know all the places where our datafiles can get modified, but
- I hope this is only done in the heap access methods and
- vacuum. So these are the places from where the redo log data
- comes from (plus transaction commit/rollback).
-
-
-Until later, Jan
-
---
-#define OPINIONS "they are all mine - not those of debis or daimler-benz"
-
-#======================================================================#
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-
-
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA06775
- for ; Mon, 3 Nov 1997 14:01:04 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA22235 for ; Mon, 3 Nov 1997 13:43:15 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA11482; Mon, 3 Nov 1997 13:32:40 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 13:32:02 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA11204 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 13:31:58 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA11119 for
; Mon, 3 Nov 1997 13:31:44 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id MAA05464;
- Mon, 3 Nov 1997 12:59:01 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Mon, 3 Nov 1997 12:59:01 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
->
-> >
-> > > > I don't know the answer, but I suspect that the commercial databases
-> > > > don't "fsync" the way pgsql does.
-> > >
-> > > Could someone try 1000 int4 inserts using postgres and
-> > > some commercial database (on the same machine) ?
-> >
-> > I have been thinking about this since seeing the performance change
-> > with/without fsync.
-> >
-> > Commerical databases usually do a log write every 5 or 15 minutes, and
-> > guarantee the logs will contain everything up to this time interval.
-> >
->
-> Without fsync PostgreSQL would only loose data if the OS
-> crashes between the last write operation of a backend and the
-> next regular update sync. This is seldom but if it happens it
-> really hurts.
->
-> A database can omit fsync on data files (e.g. tablespaces) if
-> it writes a redo log. With that redo log, a backup can be
-> restored and than all transactions since the backup redone.
->
-> PostgreSQL doesn't write such a redo log. So an OS crash
-> after the fsync of pg_log could corrupt the database without
-> a chance to recover.
->
-> Isn't it time to get an (optional) redo log. I don't exactly
-> know all the places where our datafiles can get modified, but
-> I hope this is only done in the heap access methods and
-> vacuum. So these are the places from where the redo log data
-> comes from (plus transaction commit/rollback).
->
-
-Yes, but because we are a non-over-write database, I don't see why we
-can't just do this without a redo log.
-
-Every five minutes, we fsync() all dirty pages, mark all completed
-transactions as fsync'ed in pg_log, and fsync() pg_log.
-
-On postmaster startup, any transaction marked as completed, but not
-marked as fsync'ed gets marked as aborted.
-
-Of course, all vacuum operations would have to be fsync'ed.
-
-Comments?
-
---
-Bruce Momjian
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id QAA10292
- for ; Mon, 3 Nov 1997 16:45:59 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id QAA02040 for ; Mon, 3 Nov 1997 16:42:40 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id QAA17422; Mon, 3 Nov 1997 16:34:28 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 03 Nov 1997 16:34:10 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id QAA17210 for pgsql-hackers-outgoing; Mon, 3 Nov 1997 16:34:06 -0500 (EST)
-Received: from fallon.classyad.com (
[email protected] [152.160.43.1]) by hub.org (8.8.5/8.7.5) with ESMTP id QAA16690 for
; Mon, 3 Nov 1997 16:33:27 -0500 (EST)
-Received: from fallon.classyad.com (
[email protected] [152.160.43.1]) by fallon.classyad.com (8.8.5/8.7.3) with SMTP id RAA32498; Mon, 3 Nov 1997 17:33:42 -0500
-Date: Mon, 3 Nov 1997 17:33:42 -0500 (EST)
-From: Marc Howard Zuckman
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Status: OR
-
-On Mon, 3 Nov 1997, Bruce Momjian wrote:
-
-> >
-> > >
-> > > > > I don't know the answer, but I suspect that the commercial databases
-> > > > > don't "fsync" the way pgsql does.
-> > > >
-> > > > Could someone try 1000 int4 inserts using postgres and
-> > > > some commercial database (on the same machine) ?
-> > >
-> > > I have been thinking about this since seeing the performance change
-> > > with/without fsync.
-> > >
-> > > Commerical databases usually do a log write every 5 or 15 minutes, and
-> > > guarantee the logs will contain everything up to this time interval.
-> > >
-> >
-> > Without fsync PostgreSQL would only loose data if the OS
-> > crashes between the last write operation of a backend and the
-> > next regular update sync. This is seldom but if it happens it
-> > really hurts.
-> >
-> > A database can omit fsync on data files (e.g. tablespaces) if
-> > it writes a redo log. With that redo log, a backup can be
-> > restored and than all transactions since the backup redone.
-> >
-> > PostgreSQL doesn't write such a redo log. So an OS crash
-> > after the fsync of pg_log could corrupt the database without
-> > a chance to recover.
-> >
-> > Isn't it time to get an (optional) redo log. I don't exactly
-> > know all the places where our datafiles can get modified, but
-> > I hope this is only done in the heap access methods and
-> > vacuum. So these are the places from where the redo log data
-> > comes from (plus transaction commit/rollback).
-> >
->
-> Yes, but because we are a non-over-write database, I don't see why we
-> can't just do this without a redo log.
-
-Because if the hard drive is the reason for the failure (instead of
-power out, OS bites dust, etc), the database won't be of much help.
-
-The redo log should be on a device different than the database.
-
-Marc Zuckman
-
-_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
-_ Visit The Home and Condo MarketPlace _
-_ http://www.ClassyAd.com _
-_ _
-_ FREE basic property listings/advertisements and searches. _
-_ _
-_ Try our premium, yet inexpensive services for a real _
-_ selling or buying edge! _
-_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
-
-
-
-From maillist Mon Nov 3 22:59:31 1997
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id WAA16264;
- Mon, 3 Nov 1997 22:59:31 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Mon, 3 Nov 1997 22:59:30 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Status: OR
-
->
-> > > I don't know the answer, but I suspect that the commercial databases
-> > > don't "fsync" the way pgsql does.
-> >
-> > Could someone try 1000 int4 inserts using postgres and
-> > some commercial database (on the same machine) ?
->
-> I have been thinking about this since seeing the performance change
-> with/without fsync.
->
-> Commercial databases usually do a log write every 5 or 15 minutes, and
-> guarantee the logs will contain everything up to this time interval.
->
-> Couldn't we have some such mechanism? Usually they have raw space, so
-> they can control when the data is hitting the disk. Using a file
-> system, some of it may be getting to the disk without our knowing it.
->
-> What exactly is a scenario where lack of doing explicit fsync's will
-> cause data corruption, rather than just lost data from the past few
-> minutes?
->
-> I think Vadim has gotten fsync's down to fsync'ing the modified data
-> page, and pg_log.
->
-> Let's suppose we did not fsync. There could be cases where pg_log was
-> fsync'ed by the OS, and some of the modified data pages are fyncs'ed by
-> the OS, but not others. This would leave us with a partial transaction.
->
-> However, let's suppose we prevent pg_log from being fsync'ed somehow.
-> Then, because we have a no-overwrite database, we could keep control of
-> this, and write of some data pages, but not others would not cause us
-> problems because the pg_log would show all such transactions, which had
-> not had all their modified data pages fsync'ed, as non-committed.
->
-> Perhaps we can even set a flag in pg_log every five minutes to indicate
-> whether all buffers for the page have been flushed? That way we could
-> not have to worry about preventing flushing of pg_log.
->
-> Comments?
-
-OK, here is a more formal description of what I am suggesting. It will
-give us commercial dbms reliability with no-fsync performance.
-Commercial dbms's usually only give restore up to 5 minutes before the
-crash, and this is what I am suggesting. If we can do this, we can
-remove the no-fsync option.
-
-First, lets suppose there exists a shared queue that is visible to all
-backends and the postmaster that allows transaction id's to be added to
-the queue. We also add a bit to the pg_log record called 'been_synced'
-that is initially false.
-
-OK, once a backend starts a transaction, it puts a transaction id in
-pg_log. Once the transaction is finished, it is marked as committed.
-At the same time, we now put the transaction id on the shared queue.
-
-Every five minutes, or as defined by the administrator, the postmaster
-does a sync() call. On my OS, anyone use can call sync, and I think
-this is typical. update/pagecleaner does this every 30 seconds anyway,
-so it is no big deal for the postmaster to call it every 5 minutes. The
-nice thing about this is that the OS does the syncing of all the dirty
-pages for us. (An alarm() call can set up this 5 minute timing.)
-
-The postmaster then locks the shared transaction id queue, makes a copy
-of the entries in the queue, clears the queue, and unlocks the queue.
-It does this so no one else modifies the queue while it is being
-cleared.
-
-The postmaster then goes through pg_log, and marks each transaction as
-'been_synced'.
-
-The postmaster also performs this on shutdown.
-
-On postmaster startup, all transactions are checked and any transaction
-that is marked as committed but not 'been_synced' is marked as not
-committed. In this way, we prevent non-synced or partially synced
-transactions from being used.
-
-Of course, vacuum would have to do normal fsyncs because it is removing
-the transaction log.
-
-We need the shared transaction id queue because there is no way to find
-the newly committed transactions since the last sync. A transaction
-can last for hours.
-
---
-Bruce Momjian
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id CAA17544
- for ; Tue, 4 Nov 1997 02:13:06 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id CAA14126; Tue, 4 Nov 1997 02:07:55 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 02:04:59 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id CAA12859 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 02:04:51 -0500 (EST)
-Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by hub.org (8.8.5/8.7.5) with SMTP id CAA12625 for ; Tue, 4 Nov 1997 02:04:12 -0500 (EST)
-Received: by orion.SAPserv.Hamburg.dsh.de
- (Linux Smail3.1.29.1 #1)}
- id m0xSd44-000BFQC; Tue, 4 Nov 97 08:06 MET
-Message-Id:
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Tue, 4 Nov 1997 08:06:16 +0100 (MET)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=iso-8859-1
-Content-Transfer-Encoding: 8bit
-Precedence: bulk
-Status: OR
-
-> OK, here is a more formal description of what I am suggesting. It will
-> give us commercial dbms reliability with no-fsync performance.
-> Commercial dbms's usually only give restore up to 5 minutes before the
-> crash, and this is what I am suggesting. If we can do this, we can
-> remove the no-fsync option.
-
- I'm not 100% sure but as far as I know Oracle, it can recover
- up to the last committed transaction using the online redo
- logs. And even if commercial dbms's aren't able to do that,
- it should be our target.
-
-> [description about transaction queue]
-
- This all depends on the fact that PostgreSQL is a no
- overwrite dbms. Otherwise the space of deleted tuples might
- get overwritten by later transactions and the information is
- finally lost.
-
- Another issue: All we up to now though of are crashes where
- the database files are still usable after restart. But take
- the simple case of a write error. A new bad block or track
- will get remapped (in some way) but the data in it is lost.
- So we end up with one or more totally corrupted database
- files. And I don't trust mirrored disks farer than I can
- throw them. A bug in the OS or a memory failure (many new
- PeeCee boards don't support parity and even with parity a two
- bit failure is still the wrong data but with a valid parity
- bit) can also currupt the data.
-
- I still prefer redo logs. They should reside on a different
- disk and the possibility of loosing the database files along
- with the redo log is very small.
-
-
-Until later, Jan
-
---
-#define OPINIONS "they are all mine - not those of debis or daimler-benz"
-
-#======================================================================#
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-
-
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18487
- for ; Tue, 4 Nov 1997 04:12:48 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id EAA03152 for ; Tue, 4 Nov 1997 04:12:06 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
-Date: Tue, 04 Nov 1997 16:14:05 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> OK, here is a more formal description of what I am suggesting. It will
-> give us commercial dbms reliability with no-fsync performance.
-> Commercial dbms's usually only give restore up to 5 minutes before the
- ^^^^^^^^^^^^^^^^^^^^^^^
-I'm sure that this is not true!
-If on-line redo_file is damaged then you have
-single ability: restore your last backup.
-In all other cases database will be recovered up to the last
-committed transaction automatically!
-
-DBMS-s using WAL have to fsync only redo file on commit
-(and they do it!), non-overwriting systems have to
-fsync data files and transaction log.
-
-We could optimize fsync-s for multi-user environment: do not
-fsync when we're ensured that our changes flushed to disk by
-another backend.
-
-> crash, and this is what I am suggesting. If we can do this, we can
-> remove the no-fsync option.
->
-...
->
-> On postmaster startup, all transactions are checked and any transaction
-> that is marked as committed but not 'been_synced' is marked as not
-> committed. In this way, we prevent non-synced or partially synced
-> transactions from being used.
-
-And what should users (ensured that their transaction are
-committed) do in this case ?
-
-Vadim
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id EAA18536
- for ; Tue, 4 Nov 1997 04:21:01 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id EAA15551; Tue, 4 Nov 1997 04:15:15 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 04:14:23 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id EAA14464 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 04:14:18 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id EAA13437 for ; Tue, 4 Nov 1997 04:13:33 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id QAA20591; Tue, 4 Nov 1997 16:14:06 +0700 (KRS)
-Date: Tue, 04 Nov 1997 16:14:05 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Bruce Momjian wrote:
->
-> OK, here is a more formal description of what I am suggesting. It will
-> give us commercial dbms reliability with no-fsync performance.
-> Commercial dbms's usually only give restore up to 5 minutes before the
- ^^^^^^^^^^^^^^^^^^^^^^^
-I'm sure that this is not true!
-If on-line redo_file is damaged then you have
-single ability: restore your last backup.
-In all other cases database will be recovered up to the last
-committed transaction automatically!
-
-DBMS-s using WAL have to fsync only redo file on commit
-(and they do it!), non-overwriting systems have to
-fsync data files and transaction log.
-
-We could optimize fsync-s for multi-user environment: do not
-fsync when we're ensured that our changes flushed to disk by
-another backend.
-
-> crash, and this is what I am suggesting. If we can do this, we can
-> remove the no-fsync option.
->
-...
->
-> On postmaster startup, all transactions are checked and any transaction
-> that is marked as committed but not 'been_synced' is marked as not
-> committed. In this way, we prevent non-synced or partially synced
-> transactions from being used.
-
-And what should users (ensured that their transaction are
-committed) do in this case ?
-
-Vadim
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id GAA19743
- for ; Tue, 4 Nov 1997 06:42:57 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id GAA10352; Tue, 4 Nov 1997 06:36:08 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 06:35:42 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id GAA10158 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 06:35:37 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id GAA10096 for
; Tue, 4 Nov 1997 06:35:27 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id GAA19665;
- Tue, 4 Nov 1997 06:35:10 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Tue, 4 Nov 1997 06:35:10 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
->
-> > OK, here is a more formal description of what I am suggesting. It will
-> > give us commercial dbms reliability with no-fsync performance.
-> > Commercial dbms's usually only give restore up to 5 minutes before the
-> > crash, and this is what I am suggesting. If we can do this, we can
-> > remove the no-fsync option.
->
-> I'm not 100% sure but as far as I know Oracle, it can recover
-> up to the last committed transaction using the online redo
-> logs. And even if commercial dbms's aren't able to do that,
-> it should be our target.
->
-> > [description about transaction queue]
->
-> This all depends on the fact that PostgreSQL is a no
-> overwrite dbms. Otherwise the space of deleted tuples might
-> get overwritten by later transactions and the information is
-> finally lost.
->
-> Another issue: All we up to now though of are crashes where
-> the database files are still usable after restart. But take
-> the simple case of a write error. A new bad block or track
-> will get remapped (in some way) but the data in it is lost.
-> So we end up with one or more totally corrupted database
-> files. And I don't trust mirrored disks farer than I can
-> throw them. A bug in the OS or a memory failure (many new
-> PeeCee boards don't support parity and even with parity a two
-> bit failure is still the wrong data but with a valid parity
-> bit) can also currupt the data.
->
-> I still prefer redo logs. They should reside on a different
-> disk and the possibility of loosing the database files along
-> with the redo log is very small.
-
-I have been thinking about re-do logs, and I think it is a good idea.
-It would not be hard to have the queries spit out to a separate file
-configurable by the user.
-
---
-Bruce Momjian
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22051
- for ; Tue, 4 Nov 1997 07:30:59 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07444 for ; Tue, 4 Nov 1997 07:25:14 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08818; Tue, 4 Nov 1997 07:03:30 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:02:44 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08418 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:29 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08331 for
; Tue, 4 Nov 1997 07:02:07 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id GAA21484;
- Tue, 4 Nov 1997 06:50:24 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Tue, 4 Nov 1997 06:50:24 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
->
-> Bruce Momjian wrote:
-> >
-> > OK, here is a more formal description of what I am suggesting. It will
-> > give us commercial dbms reliability with no-fsync performance.
-> > Commercial dbms's usually only give restore up to 5 minutes before the
-> ^^^^^^^^^^^^^^^^^^^^^^^
-> I'm sure that this is not true!
-
-You may be right. This five minute figure is when you restore from your
-previous backup, then restore from the log file.
-
-Can't we do something like sync every 5 seconds, rather than after every
-transaction? It just seems like such overkill.
-
-Actually, I found a problem with my description. Because pg_log is not
-fsync'ed, after a crash, pages with new transactions could have been
-flushed to disk, but not the pg_log table that contains the transaction
-ids. The problem is that the new backend could assign a transaction id
-that is already in use.
-
-We could set a flag upon successful shutdown, and if it is not set on
-reboot, either do a vacuum to find the max transaction id, and
-invalidate all them not in pg_log as synced, or increase the next
-transaction id to some huge number and invalidate all them in between.
-
-
-> If on-line redo_file is damaged then you have
-> single ability: restore your last backup.
-> In all other cases database will be recovered up to the last
-> committed transaction automatically!
->
-> DBMS-s using WAL have to fsync only redo file on commit
-> (and they do it!), non-overwriting systems have to
-> fsync data files and transaction log.
->
-> We could optimize fsync-s for multi-user environment: do not
-> fsync when we're ensured that our changes flushed to disk by
-> another backend.
->
-> > crash, and this is what I am suggesting. If we can do this, we can
-> > remove the no-fsync option.
-> >
-> ...
-> >
-> > On postmaster startup, all transactions are checked and any transaction
-> > that is marked as committed but not 'been_synced' is marked as not
-> > committed. In this way, we prevent non-synced or partially synced
-> > transactions from being used.
->
-> And what should users (ensured that their transaction are
-> committed) do in this case ?
->
-> Vadim
->
->
-
-
---
-Bruce Momjian
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA21697
- for ; Tue, 4 Nov 1997 07:00:58 -0500 (EST)
-Received: from orion.SAPserv.Hamburg.dsh.de (polaris.sapserv.debis.de [53.2.131.8]) by renoir.op.net (o1/$ Revision: 1.14 $) with SMTP id GAA06401 for ; Tue, 4 Nov 1997 06:48:25 -0500 (EST)
-Received: by orion.SAPserv.Hamburg.dsh.de
- (Linux Smail3.1.29.1 #1)}
- id m0xShVQ-000BGZC; Tue, 4 Nov 97 12:50 MET
-Message-Id:
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Tue, 4 Nov 1997 12:50:45 +0100 (MET)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=iso-8859-1
-Content-Transfer-Encoding: 8bit
-Status: OR
-
-
-Bruce Momjian wrote:
-> I have been thinking about re-do logs, and I think it is a good idea.
-> It would not be hard to have the queries spit out to a separate file
-> configurable by the user.
-
- This way the recovery process will be very complicated. When
- multiple backends run concurrently, there are multiple
- transactions active at the same time. And what tuples are
- affected by an update e.g. depends much on the timing.
-
- I had something different in mind. The redo log contains the
- information from the executor (e.g. the transactionId, the
- tupleId and the new tuple values when calling ExecReplace())
- and the information which transactions commit and which not.
- When recovering, those operations where the transactions
- committed are again passed to the executors functions that do
- the real updates with the values from the logfile.
-
-
-Until later, Jan
-
---
-#define OPINIONS "they are all mine - not those of debis or daimler-benz"
-
-#======================================================================#
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22048
- for ; Tue, 4 Nov 1997 07:30:57 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA07189 for ; Tue, 4 Nov 1997 07:18:02 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id HAA08856; Tue, 4 Nov 1997 07:03:37 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 07:03:03 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id HAA08487 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 07:02:46 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id HAA08192 for
; Tue, 4 Nov 1997 07:02:02 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id HAA21653;
- Tue, 4 Nov 1997 07:00:20 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!u
-Date: Tue, 4 Nov 1997 07:00:19 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
->
-> Bruce Momjian wrote:
-> >
-> > OK, here is a more formal description of what I am suggesting. It will
-> > give us commercial dbms reliability with no-fsync performance.
-> > Commercial dbms's usually only give restore up to 5 minutes before the
-> ^^^^^^^^^^^^^^^^^^^^^^^
-> I'm sure that this is not true!
-> If on-line redo_file is damaged then you have
-> single ability: restore your last backup.
-> In all other cases database will be recovered up to the last
-> committed transaction automatically!
-
-I doubt commercial dbms's sync to disk after every transaction. They
-pick a time, maybe five seconds, and see all dirty pages get flushed by
-then.
-
-What they do do is to make certain that you are restored to a consistent
-state, perhaps 15 seconds ago.
-
---
-Bruce Momjian
-
-
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id HAA22066
- for ; Tue, 4 Nov 1997 07:32:35 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20889; Tue, 4 Nov 1997 19:35:12 +0700 (KRS)
-Date: Tue, 04 Nov 1997 19:35:12 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Jan Wieck
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-References:
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
->
-> I still prefer redo logs. They should reside on a different
-> disk and the possibility of loosing the database files along
-> with the redo log is very small.
-
-Agreed. This way we could don't fsync data files and
-fsync both redo and pg_log. This is much faster.
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id IAA22371
- for ; Tue, 4 Nov 1997 08:00:56 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id HAA08540 for ; Tue, 4 Nov 1997 07:57:25 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id TAA20935; Tue, 4 Nov 1997 19:59:46 +0700 (KRS)
-Date: Tue, 04 Nov 1997 19:59:46 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Jan Wieck
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-References:
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
->
-> Bruce Momjian wrote:
-> > I have been thinking about re-do logs, and I think it is a good idea.
-> > It would not be hard to have the queries spit out to a separate file
-> > configurable by the user.
->
-> This way the recovery process will be very complicated. When
-> multiple backends run concurrently, there are multiple
-> transactions active at the same time. And what tuples are
-> affected by an update e.g. depends much on the timing.
->
-> I had something different in mind. The redo log contains the
-> information from the executor (e.g. the transactionId, the
-> tupleId and the new tuple values when calling ExecReplace())
-> and the information which transactions commit and which not.
-> When recovering, those operations where the transactions
-> committed are again passed to the executors functions that do
-> the real updates with the values from the logfile.
-
-It seems that this is what Oracle does, but Sybase writes queries
-(with transaction ids, of 'course, and before execution) and
-begin, commit/abort events <-- this is better for non-overwriting
-system (shorter redo file), but, agreed, recovering is more complicated.
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA05060
- for ; Tue, 4 Nov 1997 22:35:43 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA26725 for ; Tue, 4 Nov 1997 22:35:10 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id WAA27875; Tue, 4 Nov 1997 22:23:14 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 04 Nov 1997 22:20:55 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id WAA24162 for pgsql-hackers-outgoing; Tue, 4 Nov 1997 22:20:50 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id WAA22727 for
; Tue, 4 Nov 1997 22:20:18 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id WAA04674;
- Tue, 4 Nov 1997 22:17:52 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Tue, 4 Nov 1997 22:17:52 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
->
-> Bruce Momjian wrote:
-> >
-> > >
-> > > Bruce Momjian wrote:
-> > > >
-> > > > OK, here is a more formal description of what I am suggesting. It will
-> > > > give us commercial dbms reliability with no-fsync performance.
-> > > > Commercial dbms's usually only give restore up to 5 minutes before the
-> > > ^^^^^^^^^^^^^^^^^^^^^^^
-> > > I'm sure that this is not true!
-> >
-> > You may be right. This five minute figure is when you restore from your
-> > previous backup, then restore from the log file.
-> >
-> > Can't we do something like sync every 5 seconds, rather than after every
-> > transaction? It just seems like such overkill.
->
-> Isn't -F and sync in crontab the same ?
-
-OK, let me again try to marshall some (any?) support for my suggestion.
-
-Informix version 5/7 has three levels of logging: unbuffered
-logging(our normal fsync mode), buffered logging, and no logging(our no
-fsync mode).
-
-We don't have buffered logging. Buffered logging guarantees you get put
-back to a consistent state after an os/server crash, usually to within
-30/90 seconds. You do not have any partial transactions lying around,
-but you do have some transactions that you thought were done, but are
-not.
-
-This is faster then non-buffered logging, but not as fast as no logging.
-Guess what mode everyone uses? The one we don't have, buffered logging!
-
-Unbuffered logging performance is terrible. Non-buffered logging is
-used to load huge chunks of data during off-hours.
-
-The problem we have is that we fsync every transaction, which causes a
-9-times slowdown in performance on single-integer inserts.
-
-That is a pretty heavy cost. But the alternative we give people is
-no-fsync mode, where we don't sync anything, and in a crash, you could
-come back with partially committed data in your database, if pg_log was
-sync'ed by the database, and only some of the data pages were sync'ed,
-so if any data was changing within 30 seconds of the crash, you have to
-restore your previous backup.
-
-We really need a middle solution, that gives better data integrity, for
-a smaller price.
-
->
-> >
-> > Actually, I found a problem with my description. Because pg_log is not
-> > fsync'ed, after a crash, pages with new transactions could have been
-> > flushed to disk, but not the pg_log table that contains the transaction
-> > ids. The problem is that the new backend could assign a transaction id
-> > that is already in use.
->
-> Impossible. Backend flushes pg_variable after fetching nex 32 xids.
-
-My suggestion is that we don't need to flush pg_variable or pg_log that
-much. My suggestion would speed up the test you do with 100 inserts
-inside a single transaction vs. 100 separate inserts.
-
-> >
-> > We could set a flag upon successful shutdown, and if it is not set on
-> > reboot, either do a vacuum to find the max transaction id, and
-> > invalidate all them not in pg_log as synced, or increase the next
-> > transaction id to some huge number and invalidate all them in between.
-> >
-
-I have a fix for the problem stated above, and it doesn't require a
-vacuum.
-
-We decide to fsync pg_variable and pg_log every 10,000 transactions or
-oids. Then if the database is brought up, and it was not brought down
-cleanly, you increment oid and transaction_id by 10,000, because you
-know you couldn't have gotten more than that. All intermediate
-transactions that are not marked committed/synced are marked aborted.
-
----------------------------------------------------------------------------
-
-The problem we have with the current system is that we sync by action,
-not by time interval. If you are doing tons of inserts or updates, it
-is syncing after every one. What people really want is something that
-will sync not after every action, but after every minute or five
-minutes, so when the system is busy, the syncing every minutes is just a
-small amount, and when the system is idle, no one cares if is syncs, and
-no one has to wait for the sync to complete.
-
-
---
-Bruce Momjian
-
-
-Received: from smtp.algonet.se (tomei.algonet.se [194.213.74.114])
- by candle.pha.pa.us (8.8.5/8.8.5) with SMTP id LAA02099
- for ; Wed, 5 Nov 1997 11:02:28 -0500 (EST)
-Received: (qmail 6685 invoked from network); 5 Nov 1997 17:01:06 +0100
- by tomei.algonet.se with SMTP; 5 Nov 1997 17:01:06 +0100
-Sender: root
-Date: Wed, 05 Nov 1997 17:02:16 +0100
-From: Mattias Kregert
-Organization: Algonet ISP
-X-Mailer: Mozilla 3.0Gold (X11; I; Linux 2.0.29 i586)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> We don't have buffered logging. Buffered logging guarantees you get put
-> back to a consistent state after an os/server crash, usually to within
-> 30/90 seconds. You do not have any partial transactions lying around,
-> but you do have some transactions that you thought were done, but are
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-> not.
- ^^^^
->
-> This is faster then non-buffered logging, but not as fast as no logging.
-> Guess what mode everyone uses? The one we don't have, buffered logging!
-
-Ouch! I would *not* like to use "buffered logging".
-What's the point in having the wrong data in the database and not
-knowing what updates, inserts or deletes to do to get the correct data?
-
-That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
-
-
-> We really need a middle solution, that gives better data integrity, for
-> a smaller price.
-
-What I would like to have is this:
-
-If a backend tells the frontend that a transaction has completed,
-then that transaction should absolutely not get lost in case of a crash.
-
-What is needed is a log of changes since the last backup. This
-log would preferrably reside on a remote machine or at least
-another disk. Then, if the power goes in the middle of a disk write,
-the disk explodes and the computer goes up in flames, you can
-install Postgresql on a new machine, restore the last backup and
-re-run the change log.
-
-
-> The problem we have with the current system is that we sync by action,
-> not by time interval. If you are doing tons of inserts or updates, it
-> is syncing after every one. What people really want is something that
-> will sync not after every action, but after every minute or five
-> minutes, so when the system is busy, the syncing every minutes is just a
-> small amount, and when the system is idle, no one cares if is syncs, and
-> no one has to wait for the sync to complete.
-
-Yes, but this would only be the first step on the way to better
-crash-recovery.
-
-/* m */
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05156
- for ; Wed, 5 Nov 1997 12:20:13 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24123 for ; Wed, 5 Nov 1997 11:44:49 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id XAA23062; Wed, 5 Nov 1997 23:48:52 +0700 (KRS)
-Date: Wed, 05 Nov 1997 23:48:52 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> OK, let me again try to marshall some (any?) support for my suggestion.
->
-> Informix version 5/7 has three levels of logging: unbuffered
-> logging(our normal fsync mode), buffered logging, and no logging(our no
-> fsync mode).
->
-> We don't have buffered logging. Buffered logging guarantees you get put
-> back to a consistent state after an os/server crash, usually to within
-> 30/90 seconds. You do not have any partial transactions lying around,
-> but you do have some transactions that you thought were done, but are
-> not.
->
-> This is faster then non-buffered logging, but not as fast as no logging.
-> Guess what mode everyone uses? The one we don't have, buffered logging!
->
-> Unbuffered logging performance is terrible. Non-buffered logging is
-> used to load huge chunks of data during off-hours.
->
-> The problem we have is that we fsync every transaction, which causes a
-> 9-times slowdown in performance on single-integer inserts.
->
-> That is a pretty heavy cost. But the alternative we give people is
-> no-fsync mode, where we don't sync anything, and in a crash, you could
-> come back with partially committed data in your database, if pg_log was
-> sync'ed by the database, and only some of the data pages were sync'ed,
-> so if any data was changing within 30 seconds of the crash, you have to
-> restore your previous backup.
->
-> We really need a middle solution, that gives better data integrity, for
-> a smaller price.
-
-There is no fsync synchronization currently.
-How could we be ensured that all modified data pages are flushed
-when we decided to flush pg_log ?
-If backend doesn't fsync data pages & pg_log at the commit time
-then when he must flush them (data first) ?
-
-This is what Oracle does:
-
-it uses dedicated DBWR process for writing/flushing modified
-data pages and LGWR process for writing/flushing redo log
-(redo log is transaction log also). LGWR always flushes log pages
-when committing, but durty data pages can be flushed _after_ transaction
-commit when DBWR decides that it's time to do it (ala checkpoints interval).
-
-Using redo log we could implement buffered logging quite easy.
-We can even don't use dedicated processes (but flush redo before pg_log),
-though having LGWR could simplify things.
-
-Without redo log or without some fsync synchronization we can't implement
-buffered logging. BTW, shared system cache could help with
-fsync synchonization, but, imho, redo is better (and faster for
-un-buffered logging too).
-
-> > > Actually, I found a problem with my description. Because pg_log is not
-> > > fsync'ed, after a crash, pages with new transactions could have been
-> > > flushed to disk, but not the pg_log table that contains the transaction
-> > > ids. The problem is that the new backend could assign a transaction id
-> > > that is already in use.
-> >
-> > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
->
-> My suggestion is that we don't need to flush pg_variable or pg_log that
-> much. My suggestion would speed up the test you do with 100 inserts
-> inside a single transaction vs. 100 separate inserts.
->
-> > >
-> > > We could set a flag upon successful shutdown, and if it is not set on
-> > > reboot, either do a vacuum to find the max transaction id, and
-> > > invalidate all them not in pg_log as synced, or increase the next
-> > > transaction id to some huge number and invalidate all them in between.
-> > >
->
-> I have a fix for the problem stated above, and it doesn't require a
-> vacuum.
->
-> We decide to fsync pg_variable and pg_log every 10,000 transactions or
-> oids. Then if the database is brought up, and it was not brought down
-> cleanly, you increment oid and transaction_id by 10,000, because you
-> know you couldn't have gotten more than that. All intermediate
-> transactions that are not marked committed/synced are marked aborted.
-
-This is what I suppose to do by placing next available oid/xid
-in shmem: this allows pre-fetch much more than 32 ids at once
-without losing them when session closed.
-
-> The problem we have with the current system is that we sync by action,
-> not by time interval. If you are doing tons of inserts or updates, it
-> is syncing after every one. What people really want is something that
-> will sync not after every action, but after every minute or five
-> minutes, so when the system is busy, the syncing every minutes is just a
-> small amount, and when the system is idle, no one cares if is syncs, and
-> no one has to wait for the sync to complete.
-
-When I'm really doing tons of inserts/updates/deletes I use
-BEGIN/END. But it doesn't work for multi-user environment, of 'course.
-As for about what people really want, I remember that recently someone
-said in user list that if one want to have 10-20 inserts/sec then he
-should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
-when using one session, 32 inserts/sec with two sessions inserting
-in two different tables and only 20 inserts/sec with two sessions
-inserting in the same table. Imho, this difference between 20 and 32
-is more important thing to fix, and these results are not so bad
-in comparison with others.
-
-(BTW, we shouldn't forget about using raw devices to speed up things).
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05150
- for ; Wed, 5 Nov 1997 12:20:07 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id LAA24889 for ; Wed, 5 Nov 1997 11:59:27 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
-Date: Thu, 06 Nov 1997 00:03:19 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Mattias Kregert
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Mattias Kregert wrote:
->
-> Bruce Momjian wrote:
-> >
-> > We don't have buffered logging. Buffered logging guarantees you get put
-> > back to a consistent state after an os/server crash, usually to within
-> > 30/90 seconds. You do not have any partial transactions lying around,
-> > but you do have some transactions that you thought were done, but are
-> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-> > not.
-> ^^^^
-> >
-> > This is faster then non-buffered logging, but not as fast as no logging.
-> > Guess what mode everyone uses? The one we don't have, buffered logging!
->
-> Ouch! I would *not* like to use "buffered logging".
-
-And I.
-
-> What's the point in having the wrong data in the database and not
-> knowing what updates, inserts or deletes to do to get the correct data?
->
-> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
->
-> > We really need a middle solution, that gives better data integrity, for
-> > a smaller price.
->
-> What I would like to have is this:
->
-> If a backend tells the frontend that a transaction has completed,
-> then that transaction should absolutely not get lost in case of a crash.
-
-Agreed.
-
->
-> What is needed is a log of changes since the last backup. This
-> log would preferrably reside on a remote machine or at least
-> another disk. Then, if the power goes in the middle of a disk write,
-> the disk explodes and the computer goes up in flames, you can
-> install Postgresql on a new machine, restore the last backup and
-> re-run the change log.
-
-Yes. And as I already said - this will speed up things because
-redo flushing is faster than flushing NNN tables which can be
-unflushed for some interval.
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id MAA05168
- for ; Wed, 5 Nov 1997 12:20:38 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id MAA25888 for ; Wed, 5 Nov 1997 12:14:14 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id MAA02259; Wed, 5 Nov 1997 12:02:33 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 12:00:21 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id MAA00750 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 12:00:10 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id LAA00598 for
; Wed, 5 Nov 1997 11:59:45 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id AAA23096; Thu, 6 Nov 1997 00:03:19 +0700 (KRS)
-Date: Thu, 06 Nov 1997 00:03:19 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Mattias Kregert
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Mattias Kregert wrote:
->
-> Bruce Momjian wrote:
-> >
-> > We don't have buffered logging. Buffered logging guarantees you get put
-> > back to a consistent state after an os/server crash, usually to within
-> > 30/90 seconds. You do not have any partial transactions lying around,
-> > but you do have some transactions that you thought were done, but are
-> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-> > not.
-> ^^^^
-> >
-> > This is faster then non-buffered logging, but not as fast as no logging.
-> > Guess what mode everyone uses? The one we don't have, buffered logging!
->
-> Ouch! I would *not* like to use "buffered logging".
-
-And I.
-
-> What's the point in having the wrong data in the database and not
-> knowing what updates, inserts or deletes to do to get the correct data?
->
-> That's irrecoverable loss of data. Not what *I* want. Do *you* want it?
->
-> > We really need a middle solution, that gives better data integrity, for
-> > a smaller price.
->
-> What I would like to have is this:
->
-> If a backend tells the frontend that a transaction has completed,
-> then that transaction should absolutely not get lost in case of a crash.
-
-Agreed.
-
->
-> What is needed is a log of changes since the last backup. This
-> log would preferrably reside on a remote machine or at least
-> another disk. Then, if the power goes in the middle of a disk write,
-> the disk explodes and the computer goes up in flames, you can
-> install Postgresql on a new machine, restore the last backup and
-> re-run the change log.
-
-Yes. And as I already said - this will speed up things because
-redo flushing is faster than flushing NNN tables which can be
-unflushed for some interval.
-
-Vadim
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07017
- for ; Wed, 5 Nov 1997 14:00:59 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id NAA01759 for ; Wed, 5 Nov 1997 13:52:36 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id NAA03611; Wed, 5 Nov 1997 13:29:43 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 13:27:48 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id NAA03291 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 13:27:41 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id NAA02823 for
; Wed, 5 Nov 1997 13:26:20 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id NAA05863;
- Wed, 5 Nov 1997 13:16:09 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Wed, 5 Nov 1997 13:16:09 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-> There is no fsync synchronization currently.
-> How could we be ensured that all modified data pages are flushed
-> when we decided to flush pg_log ?
-> If backend doesn't fsync data pages & pg_log at the commit time
-> then when he must flush them (data first) ?
-
-My idea was to have the backend do a 'sync' that causes the OS to sync
-all dirty pages, then mark all committed transactions on pg_log as
-'synced'. Then sync pg_log. That way, there is a clear system where we
-know everything is flushed to disk, and we mark the transactions as
-synced.
-
-The only time that synced flag is used, is when the database starts up,
-and it sees that the previous shutdown was not clean.
-
-What am I missing here?
-
->
-> This is what Oracle does:
->
-> it uses dedicated DBWR process for writing/flushing modified
-> data pages and LGWR process for writing/flushing redo log
-> (redo log is transaction log also). LGWR always flushes log pages
-> when committing, but durty data pages can be flushed _after_ transaction
-> commit when DBWR decides that it's time to do it (ala checkpoints interval).
->
-> Using redo log we could implement buffered logging quite easy.
-> We can even don't use dedicated processes (but flush redo before pg_log),
-> though having LGWR could simplify things.
->
-> Without redo log or without some fsync synchronization we can't implement
-> buffered logging. BTW, shared system cache could help with
-> fsync synchonization, but, imho, redo is better (and faster for
-> un-buffered logging too).
->
-
-I suggested my solution because it is clean, does flushing in one
-central location(postmaster), and does quick restores.
-
-> > > > Actually, I found a problem with my description. Because pg_log is not
-> > > > fsync'ed, after a crash, pages with new transactions could have been
-> > > > flushed to disk, but not the pg_log table that contains the transaction
-> > > > ids. The problem is that the new backend could assign a transaction id
-> > > > that is already in use.
-> > >
-> > > Impossible. Backend flushes pg_variable after fetching nex 32 xids.
-> >
-> > My suggestion is that we don't need to flush pg_variable or pg_log that
-> > much. My suggestion would speed up the test you do with 100 inserts
-> > inside a single transaction vs. 100 separate inserts.
-> >
-> > > >
-> > > > We could set a flag upon successful shutdown, and if it is not set on
-> > > > reboot, either do a vacuum to find the max transaction id, and
-> > > > invalidate all them not in pg_log as synced, or increase the next
-> > > > transaction id to some huge number and invalidate all them in between.
-> > > >
-> >
-> > I have a fix for the problem stated above, and it doesn't require a
-> > vacuum.
-> >
-> > We decide to fsync pg_variable and pg_log every 10,000 transactions or
-> > oids. Then if the database is brought up, and it was not brought down
-> > cleanly, you increment oid and transaction_id by 10,000, because you
-> > know you couldn't have gotten more than that. All intermediate
-> > transactions that are not marked committed/synced are marked aborted.
->
-> This is what I suppose to do by placing next available oid/xid
-> in shmem: this allows pre-fetch much more than 32 ids at once
-> without losing them when session closed.
->
-> > The problem we have with the current system is that we sync by action,
-> > not by time interval. If you are doing tons of inserts or updates, it
-> > is syncing after every one. What people really want is something that
-> > will sync not after every action, but after every minute or five
-> > minutes, so when the system is busy, the syncing every minutes is just a
-> > small amount, and when the system is idle, no one cares if is syncs, and
-> > no one has to wait for the sync to complete.
->
-> When I'm really doing tons of inserts/updates/deletes I use
-> BEGIN/END. But it doesn't work for multi-user environment, of 'course.
-> As for about what people really want, I remember that recently someone
-> said in user list that if one want to have 10-20 inserts/sec then he
-> should use mysql, but I got 25 inserts/sec on AIC-7880 & WD Enterprise
-> when using one session, 32 inserts/sec with two sessions inserting
-> in two different tables and only 20 inserts/sec with two sessions
-> inserting in the same table. Imho, this difference between 20 and 32
-> is more important thing to fix, and these results are not so bad
-> in comparison with others.
->
-> (BTW, we shouldn't forget about using raw devices to speed up things).
->
-> Vadim
->
-
-
---
-Bruce Momjian
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id NAA06130
- for ; Wed, 5 Nov 1997 13:26:26 -0500 (EST)
- by animal.blarg.net (8.8.5/8.8.4) with SMTP
- id KAA09775; Wed, 5 Nov 1997 10:26:10 -0800
-Date: Wed, 5 Nov 1997 10:26:10 -0800 (PST)
-From: "James A. Hillyerd"
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Status: OR
-
-On Wed, 5 Nov 1997, Bruce Momjian wrote:
->
-> The strange thing I am hearing is that the people who use PostgreSQL are
-> more worried about data recovery from a crash than million-dollar
-> companies that use commercial databases.
->
-
-If I may throw in my 2 cents, I'd prefer to see that database in a
-consistent state, with the data being up to date as of 1 minute or
-less before the crash. I'd rather have higher performance than up to the
-second data.
-
--james
-
-[ http://www.blarg.net/~james/ http://www.hyperglyphics.com/ ]
-[ 1024/B11C3751 CA 1C B3 A9 07 2F 57 C9 91 F4 73 F2 19 A4 C5 88 ]
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA07830
- for ; Wed, 5 Nov 1997 14:24:02 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id OAA02778 for ; Wed, 5 Nov 1997 14:13:45 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id CAA23376; Thu, 6 Nov 1997 02:17:51 +0700 (KRS)
-Date: Thu, 06 Nov 1997 02:17:50 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> > There is no fsync synchronization currently.
-> > How could we be ensured that all modified data pages are flushed
-> > when we decided to flush pg_log ?
-> > If backend doesn't fsync data pages & pg_log at the commit time
-> > then when he must flush them (data first) ?
->
-> My idea was to have the backend do a 'sync' that causes the OS to sync
-> all dirty pages, then mark all committed transactions on pg_log as
-> 'synced'. Then sync pg_log. That way, there is a clear system where we
-> know everything is flushed to disk, and we mark the transactions as
-> synced.
->
-> The only time that synced flag is used, is when the database starts up,
-> and it sees that the previous shutdown was not clean.
->
-> What am I missing here?
-
-Ok, I see. But we can avoid 'synced' flag: we can make (just before
-sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
-to being written/fsynced and perform write/fsync from these copies
-without stopping new commits in "on-line" page(s) (nothing must go
-to disk from "on-line" log pages).
-
-Vadim
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id OAA08101
- for ; Wed, 5 Nov 1997 14:32:21 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id OAA22970; Wed, 5 Nov 1997 14:26:47 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:24:59 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA22344 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:24:56 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA22319 for
; Wed, 5 Nov 1997 14:24:38 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id OAA07661;
- Wed, 5 Nov 1997 14:22:46 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Wed, 5 Nov 1997 14:22:45 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Just a clarification. When I say the postmaster issues a sync, I mean
-sync(2), not fsync(2).
-
-The sync flushes all dirty pages on all file systems. Ordinary users
-can issue this, and update usually does this every 30 seconds anyway.
-
-By using this, we let the kernel figure out which buffers are dirty. We
-don't have to figure this out in the postmaster.
-
-Then we update the pg_log table to mark those transactions as synced.
-On recovery from a crash, we mark the committed transactions as
-uncommitted if they do not have the synced flag.
-
---
-Bruce Momjian
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA08751
- for ; Wed, 5 Nov 1997 15:10:59 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA01986; Wed, 5 Nov 1997 15:01:24 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 05 Nov 1997 14:59:32 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id OAA01414 for pgsql-hackers-outgoing; Wed, 5 Nov 1997 14:59:28 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id OAA01403 for
; Wed, 5 Nov 1997 14:59:14 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id OAA08283;
- Wed, 5 Nov 1997 14:53:55 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Wed, 5 Nov 1997 14:53:54 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-> > The only time that synced flag is used, is when the database starts up,
-> > and it sees that the previous shutdown was not clean.
-> >
-> > What am I missing here?
->
-> Ok, I see. But we can avoid 'synced' flag: we can make (just before
-> sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
-> to being written/fsynced and perform write/fsync from these copies
-> without stopping new commits in "on-line" page(s) (nothing must go
-> to disk from "on-line" log pages).
-
-[Working late tonight?]
-
-OK, now I am lost. We need the sync'ed flag so when we start the
-postmaster, and we see the database we not shut down properly, we use
-the flag to clear the commit flag from comitted transactions that were
-not sync'ed by the postmaster.
-
-In my opinion, we don't need any extra copies of pg_log, we can set
-those sync'ed flags while others are making changes, because before we
-did our sync, we gathered a list of committed transaction ids from the
-shared transaction id queue that I mentioned a while ago.
-
-We need this queue so we can find the newly-committed transactions that
-do not have a sync flag. Another way we could do this would be to scan
-pg_log before we sync, getting all the committed transaction ids without
-sync flags. No lock is needed on the table. If we miss some new ones,
-we will get them next time we scan. The problem I saw is that there is
-no way to see when to stop scanning the pg_log table for such
-transactions, so I thought each backend would have to put its newly
-committed transactions in a separate place. Maybe I am wrong.
-
-This syncing method just seems so natural since we have pg_log. That is
-why I keep bringing it up until people tell me I am stupid.
-
-This transaction commit/sync stuff is complicated, and takes a while to
-hash out in a group.
-
----------------------------------------------------------------------------
-
-I just re-read your description, and I see what you are saying. My idea
-has pg_log commit flag be real commit flags while the system is running,
-but on reboot after failure, we remove the commit flags on non-synced
-stuff before we start up.
-
-Your idea is to make pg_log commit flags only appear in in-memory copies
-of pg_log, and write the commit flags to disk only after the sync is
-done.
-
-Either way will work. The question is, "Which is easier?" The OS is
-going to sync pg_log on its own. We would almost need a second copy of
-pg_log, one copy to be used on postmaster startup, and a second to be
-used by running backends, and the postmaster would make a copy of the
-running backend pg_log, sync the disks, and copy it to the boot copy.
-
-I don't see how the backend is going to figure out which pg_log pages
-were modified and need to be sent to the boot copy of pg_log.
-
-Now that I am thinking, here is a good idea. Instead of a fancy
-transaction queue, what if we just have the backend record the lowest
-numbered transaction they commit in a shared memory area. If the
-current transaction id they commit is greater than the minimum, then
-change nothing. That way, the backend could copy all pg_log pages
-containing that minimum pg_log transaction id up to the most recent
-pg_log page, do the sync, and copy just those to the boot copy of
-pg_log.
-
-This eliminates the transaction id queue.
-
-The nice thing about the sync-flag in pg_log is that there is no copying
-by the backend. But we would have to spin through the file to set those
-sync bits. Your method just copies whole pages to the boot copy.
-
----------------------------------------------------------------------------
-
-I don't want to force this idea on anyone, or annoy anyone. I just
-think it needs to be considered. The concepts are unusual, so once
-people get the full idea, if they don't like it, we can trash it. I
-still think it holds promise.
-
---
-Bruce Momjian
-
-
-Received: from hotzsun.jpl.nasa.gov (hotzsun.jpl.nasa.gov [137.79.51.138])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA09500
- for ; Wed, 5 Nov 1997 15:30:16 -0500 (EST)
-Received: from [137.79.51.141] (hotzmac [137.79.51.141]) by hotzsun.jpl.nasa.gov (8.7.6/8.7.3) with SMTP id MAA10100; Wed, 5 Nov 1997 12:29:58 -0800 (PST)
-Message-Id:
-Mime-Version: 1.0
-Content-Type: text/plain; charset="us-ascii"
-Date: Wed, 5 Nov 1997 12:29:58 -0800
-To: Bruce Momjian ,
-Subject: Re: [HACKERS] My $.02, was: PERFORMANCE and Good Bye, Time Travel!
-Status: OR
-
-At 11:15 AM 11/5/97, Bruce Momjian wrote:
->The strange thing I am hearing is that the people who use PostgreSQL are
->more worried about data recovery from a crash than million-dollar
->companies that use commercial databases.
->
->I don't get it.
-
-I would run PG to make sure that committed transactions were really written
-to disk because that seems "correct" and I don't have the kind of
-performance requirements that would push me to do otherwise.
-
-That said, I can see a need for varying performance/crash-immunity
-tradeoffs, and at least *one* option in between "correct" and "unprotected"
-operation would seem desirable.
-
-Signature failed Preliminary Design Review.
-Feasibility of a new signature is currently being evaluated.
-
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA04634
- for ; Thu, 6 Nov 1997 15:51:08 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA24783; Thu, 6 Nov 1997 15:36:47 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 15:36:07 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA24514 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 15:36:02 -0500 (EST)
-Received: from guevara.bildbasen.kiruna.se (guevara.bildbasen.kiruna.se [193.45.225.110]) by hub.org (8.8.5/8.7.5) with SMTP id PAA24319 for
; Thu, 6 Nov 1997 15:35:32 -0500 (EST)
-Received: (qmail 9764 invoked by uid 129); 6 Nov 1997 20:34:35 -0000
-Date: 6 Nov 1997 20:34:35 -0000
-From: Goran Thyni
- Thu, 06 Nov 1997 11:40:30 +0100)
-Subject: [HACKERS] Re: Performance vs. Crash Recovery
-Mime-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Precedence: bulk
-Status: OR
-
-
-I am getting quiet bored by this discussion,
-if someone has a strong opinion about how this
-should be done go ahead and make a test implementation
-then we have something to discuss.
-
-In the mean time, if you want best possible data protection
-mount you database disk sync:ed. This is safer than any scheme
-we could come up with.
-D*mned slow too, so everybody should be happy. :-)
-
-And I see no point implement a periodic sync in postmaster.
-All unices has cron, why not just use that.
-Or even a stupid 1-liner (ba)sh-script like:
-
-while true; do sleep 20; sync; done
-
- best regards,
---
----------------------------------------------
-Göran Thyni, sysadm, JMS Bildbasen, Kiruna
-
-
-
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA04723
- for ; Thu, 6 Nov 1997 23:31:21 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25438; Fri, 7 Nov 1997 11:36:25 +0700 (KRS)
-Date: Fri, 07 Nov 1997 11:36:25 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> > > The only time that synced flag is used, is when the database starts up,
-> > > and it sees that the previous shutdown was not clean.
-> > >
-> > > What am I missing here?
-> >
-> > Ok, I see. But we can avoid 'synced' flag: we can make (just before
-> > sync-ing data pages) in-memory copies of "on-line" durty pg_log pages
-> > to being written/fsynced and perform write/fsync from these copies
-> > without stopping new commits in "on-line" page(s) (nothing must go
-> > to disk from "on-line" log pages).
->
-> [Working late tonight?]
-
-[Yes]
-
-> I just re-read your description, and I see what you are saying. My idea
-> has pg_log commit flag be real commit flags while the system is running,
-> but on reboot after failure, we remove the commit flags on non-synced
-> stuff before we start up.
->
-> Your idea is to make pg_log commit flags only appear in in-memory copies
-> of pg_log, and write the commit flags to disk only after the sync is
-> done.
->
-> Either way will work. The question is, "Which is easier?" The OS is
-> going to sync pg_log on its own. We would almost need a second copy of
-> pg_log, one copy to be used on postmaster startup, and a second to be
-> used by running backends, and the postmaster would make a copy of the
-> running backend pg_log, sync the disks, and copy it to the boot copy.
->
-> I don't see how the backend is going to figure out which pg_log pages
-> were modified and need to be sent to the boot copy of pg_log.
->
-> Now that I am thinking, here is a good idea. Instead of a fancy
-> transaction queue, what if we just have the backend record the lowest
-> numbered transaction they commit in a shared memory area. If the
-> current transaction id they commit is greater than the minimum, then
-> change nothing. That way, the backend could copy all pg_log pages
-> containing that minimum pg_log transaction id up to the most recent
-> pg_log page, do the sync, and copy just those to the boot copy of
-> pg_log.
->
-> This eliminates the transaction id queue.
->
-> The nice thing about the sync-flag in pg_log is that there is no copying
-> by the backend. But we would have to spin through the file to set those
-> sync bits. Your method just copies whole pages to the boot copy.
-
- In my plans to re-design transaction system I supposed to keep in shmem
-two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
-to access them is not good idea. Also, we could use spinlock instead of
-lock manager to synchronize access to these pages (as I see in spin.c
-spinlock-s could be shared, but only exclusive ones are used) - spinlocks
-are faster.
- These two last pg_log pages are "online" ones. Race condition: when one or
-both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-when writing commit/abort of "big" xid. This is how we could handle this
-in "buffered" logging (delayed fsync) mode:
-
- When backend want to write commit/abort status he acquires exclusive
-OnLineLogLock. If xid belongs to online pages then backend writes status
-and releases spin. If xid is less than least xid on 1st online page then
-backend releases spin and does exactly the same what he does in normal mode:
-flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
-update xid status, WriteBuffer, release write lock, flush pg_log.
-If xid is greater than max xid on 2nd online page then the simplest way is
-just do sync(); sync() (two times), flush 1st or both online pages,
-read new page(s) into online pages space, update xid status,
-release OnLineLogLock spin. We could try other ways but pg_log expanding
-is rare case (32K xids in one pg_log page)...
- All what postmaster will have to do is:
-1. Get shared OnLineLogLock.
-2. Copy 2 x 8K data to private place.
-3. Release spinlock.
-4. sync(); sync(); (two times!)
-5. Flush online pages.
-
-We could use -F DELAY_TIME to turn fsync delayed mode ON.
-
-And, btw, having two bits for xact status we have only one unused
-status value (0x11) currently - I would like to use this for
-nested xactions and savepoints...
-
-> I don't want to force this idea on anyone, or annoy anyone. I just
-> think it needs to be considered. The concepts are unusual, so once
-> people get the full idea, if they don't like it, we can trash it. I
-> still think it holds promise.
-
-Agreed.
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07651
- for ; Fri, 7 Nov 1997 01:32:47 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23328 for ; Thu, 6 Nov 1997 23:46:08 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA19565; Thu, 6 Nov 1997 23:38:55 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 06 Nov 1997 23:36:53 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA18911 for pgsql-hackers-outgoing; Thu, 6 Nov 1997 23:36:44 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA18779 for
; Thu, 6 Nov 1997 23:36:02 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25448; Fri, 7 Nov 1997 11:40:29 +0700 (KRS)
-Date: Fri, 07 Nov 1997 11:40:29 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: Sync:ing data and log (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Bruce Momjian wrote:
->
-> >
-> > Never use sync(). Use fsync(). Other processes should take care of their
-> > own syncing. If you use sync(), and you have a lot of disks, the sync
-> > can
-> > take half a minute if you are unlucky.
->
-> We could use fsync() but then the postmaster has to know what tables
-> have dirty buffers, and I don't think there is an easy way to do this.
-
-There is one way - shared system cache...
-
-Vadim
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07639
- for ; Fri, 7 Nov 1997 01:31:22 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA23094 for ; Thu, 6 Nov 1997 23:39:00 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id LAA25457; Fri, 7 Nov 1997 11:43:52 +0700 (KRS)
-Date: Fri, 07 Nov 1997 11:43:51 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Mattias Kregert
-Subject: Re: Performance vs. Crash Recovery (Was: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!)
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Mattias Kregert wrote:
->
-> > The strange thing I am hearing is that the people who use PostgreSQL are
-> > more worried about data recovery from a crash than million-dollar
-> > companies that use commercial databases.
-> >
-> > I don't get it.
->
-> Perhaps the million-dollar companies have more sophisticated hardware,
-> like big expensive disk arrays, big UPS:es and parallell backup
-> servers?
-> If so, the risk of harware failure is much smaller for them.
-
-More of that - Informix is more stable than postgres: elog(FATAL)
-occures sometime and in fsync delayed mode this will cause
-of losing xaction too, not onle hard/OS failure.
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07642
- for ; Fri, 7 Nov 1997 01:31:24 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id AAA24358 for ; Fri, 7 Nov 1997 00:09:47 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA00167; Fri, 7 Nov 1997 00:03:17 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:01:26 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA29427 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:01:19 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA29364 for
; Fri, 7 Nov 1997 00:01:02 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id XAA05565;
- Thu, 6 Nov 1997 23:54:33 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Thu, 6 Nov 1997 23:54:33 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-I was worried when you didn't respond to my last list of ideas. I
-thought perhaps the idea was getting on your nerves.
-
-I haven't dropped the idea because:
-
- 1) it offers 2-9 times speedup in database modifications
- 2) this is how the big commercial system handle it, and I think
- we need to give users this option.
- 3) in the way I had it designed, it wouldn't take much work to
- do it.
-
-Anything that promises that much speedup, if it can be done easy, I say
-lets consider it, even if you loose 60 seconds of changes.
-
-
-> In my plans to re-design transaction system I supposed to keep in shmem
-> two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
-> to access them is not good idea. Also, we could use spinlock instead of
-> lock manager to synchronize access to these pages (as I see in spin.c
-> spinlock-s could be shared, but only exclusive ones are used) - spinlocks
-> are faster.
-
-Ah, so you already had the idea of having on-line pages in shared memory
-as part of a transaction system overhaul? Right now, does each backend
-lock/read/write/unlock to get at pg_log? Wow, that is bad.
-
-Perhaps mmap() would be a good idea. My system has msync() to flush
-mmap()'ed pages to the underlying file. You would still run fsync()
-after that. This may give us the best of both worlds: a shared-memory
-area of variable size, and control of when it get flushed to disk. Do
-other OS's have this? I have a feeling OS's with unified buffer caches
-don't have this ability to determine when the underlying mmap'ed file
-gets sent to the underlying file and disk.
-
-
-> These two last pg_log pages are "online" ones. Race condition: when one or
-> both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-> when writing commit/abort of "big" xid. This is how we could handle this
-> in "buffered" logging (delayed fsync) mode:
->
-> When backend want to write commit/abort status he acquires exclusive
-> OnLineLogLock. If xid belongs to online pages then backend writes status
-> and releases spin. If xid is less than least xid on 1st online page then
-> backend releases spin and does exactly the same what he does in normal mode:
-> flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
-> update xid status, WriteBuffer, release write lock, flush pg_log.
-> If xid is greater than max xid on 2nd online page then the simplest way is
-> just do sync(); sync() (two times), flush 1st or both online pages,
-> read new page(s) into online pages space, update xid status,
-> release OnLineLogLock spin. We could try other ways but pg_log expanding
-> is rare case (32K xids in one pg_log page)...
-> All what postmaster will have to do is:
-> 1. Get shared OnLineLogLock.
-> 2. Copy 2 x 8K data to private place.
-> 3. Release spinlock.
-> 4. sync(); sync(); (two times!)
-> 5. Flush online pages.
->
-> We could use -F DELAY_TIME to turn fsync delayed mode ON.
->
-> And, btw, having two bits for xact status we have only one unused
-> status value (0x11) currently - I would like to use this for
-> nested xactions and savepoints...
-
-I saw that. By keeping two copies of pg_log, one in memory to be used
-by all backend, and another that hits the disk, it certainly will work.
-
->
-> > I don't want to force this idea on anyone, or annoy anyone. I just
-> > think it needs to be considered. The concepts are unusual, so once
-> > people get the full idea, if they don't like it, we can trash it. I
-> > still think it holds promise.
->
-> Agreed.
->
-> Vadim
->
-
-
---
-Bruce Momjian
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07314
- for ; Fri, 7 Nov 1997 01:03:05 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id AAA07879; Fri, 7 Nov 1997 00:57:42 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 00:55:52 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id AAA03918 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 00:55:46 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by hub.org (8.8.5/8.7.5) with ESMTP id AAA02961 for ; Fri, 7 Nov 1997 00:55:18 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id MAA25567; Fri, 7 Nov 1997 12:59:29 +0700 (KRS)
-Date: Fri, 07 Nov 1997 12:59:28 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Bruce Momjian wrote:
->
-> I was worried when you didn't respond to my last list of ideas. I
-> thought perhaps the idea was getting on your nerves.
-
-No, I was (and, unfortunately, I still) busy...
-
->
-> I haven't dropped the idea because:
->
-> 1) it offers 2-9 times speedup in database modifications
-> 2) this is how the big commercial system handle it, and I think
-> we need to give users this option.
-> 3) in the way I had it designed, it wouldn't take much work to
-> do it.
->
-> Anything that promises that much speedup, if it can be done easy, I say
-> lets consider it, even if you loose 60 seconds of changes.
-
-I agreed with your un-buffered logging idea. This would be excellent
-feature for un-critical dbase usings (WWW, etc).
-
->
-> > In my plans to re-design transaction system I supposed to keep in shmem
-> > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
-> > to access them is not good idea. Also, we could use spinlock instead of
-> > lock manager to synchronize access to these pages (as I see in spin.c
-> > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
-> > are faster.
->
-> Ah, so you already had the idea of having on-line pages in shared memory
-> as part of a transaction system overhaul? Right now, does each backend
-
-Yes. I hope to implement this in the next 1-2 weeks.
-
-> lock/read/write/unlock to get at pg_log? Wow, that is bad.
-
-Yes, he does.
-
->
-> Perhaps mmap() would be a good idea. My system has msync() to flush
-> mmap()'ed pages to the underlying file. You would still run fsync()
-> after that. This may give us the best of both worlds: a shared-memory
- ^^^^^^^^^^^^^
-> area of variable size, and control of when it get flushed to disk. Do
- ^^^^^^^^^^^^^^^^^^^^^
-I like it. FreeBSD supports
-
-MAP_ANON Map anonymous memory not associated with any specific file.
-
-It would be nice to use mmap to get more "shared" memory, but I don't see
-reasons to mmap any particular file to memory. Having two last pg_log pages
-in memory + xact commit/abort writeback optimization (updation of commit/abort
-xmin/xmax status in tuples by any scan - we already have this) reduce access
-to "old" pg_log pages to zero.
-
-> other OS's have this? I have a feeling OS's with unified buffer caches
-> don't have this ability to determine when the underlying mmap'ed file
-> gets sent to the underlying file and disk.
->
-> > These two last pg_log pages are "online" ones. Race condition: when one or
-> > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-> > when writing commit/abort of "big" xid. This is how we could handle this
-> > in "buffered" logging (delayed fsync) mode:
-> >
-> > When backend want to write commit/abort status he acquires exclusive
-> > OnLineLogLock. If xid belongs to online pages then backend writes status
-> > and releases spin. If xid is less than least xid on 1st online page then
-> > backend releases spin and does exactly the same what he does in normal mode:
-> > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
-> > update xid status, WriteBuffer, release write lock, flush pg_log.
-> > If xid is greater than max xid on 2nd online page then the simplest way is
-> > just do sync(); sync() (two times), flush 1st or both online pages,
-> > read new page(s) into online pages space, update xid status,
-> > release OnLineLogLock spin. We could try other ways but pg_log expanding
-> > is rare case (32K xids in one pg_log page)...
-> > All what postmaster will have to do is:
-> > 1. Get shared OnLineLogLock.
-> > 2. Copy 2 x 8K data to private place.
-> > 3. Release spinlock.
-> > 4. sync(); sync(); (two times!)
-> > 5. Flush online pages.
-> >
-> > We could use -F DELAY_TIME to turn fsync delayed mode ON.
-> >
-> > And, btw, having two bits for xact status we have only one unused
-> > status value (0x11) currently - I would like to use this for
-> > nested xactions and savepoints...
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-More about this: 0x11 could mean "this _child_ transaction is committed -
-you have to lookup in pg_xact_child to get parent xid and use pg_log again
-to get parent xact status". If parent committed then child xact status
-will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
-we could get xact nesting and savepoints by starting new child xaction
-inside running one...
-
->
-> I saw that. By keeping two copies of pg_log, one in memory to be used
- ^^^^^^
- Just two pg_log pages...
-
-> by all backend, and another that hits the disk, it certainly will work.
-
-Vadim
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id BAA07599
- for ; Fri, 7 Nov 1997 01:30:58 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id BAA26793 for ; Fri, 7 Nov 1997 01:12:33 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id NAA25592; Fri, 7 Nov 1997 13:16:39 +0700 (KRS)
-Date: Fri, 07 Nov 1997 13:16:39 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Jan Wieck
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-References:
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
->
-> Bruce wrote:
-> >
-> > > > It seems that this is what Oracle does, but Sybase writes queries
-> > > > (with transaction ids, of 'course, and before execution) and
-> > > > begin, commit/abort events <-- this is better for non-overwriting
-> > > > system (shorter redo file), but, agreed, recovering is more complicated.
-> > > >
-> > > > Vadim
-> > > >
-> > >
-> > > Writing only the queries (and only those that really modify
-> > > data - no selects) would be much smarter and the redo files
-> > > will be shorter. But it wouldn't fit for PostgreSQL as long
-> > > as someone can submit a query like
-> > >
-> > > DELETE FROM xxx WHERE oid = 59337;
-> >
-> > Interesting point. Currently, an insert shows the OID as output in
-> > psql. Perhaps we could do a little oid-manipulating to set the oid of
-> > the insert.
->
-> Only for simple inserts, not on
->
-> INSERT INTO xxx SELECT any_type_of_merge_join;
-
-I don't know how but Sybase handle this and IDENTITY (case of OIDs) too.
-But I don't object you, Jan, just because I havn't time to do
-"log queries" redo implementation and so I would like to have "log changes"
-redo at least. (Actually, "log changes" is good for my production dbase
-with 1 - 2 thousand updations per day).
-(BTW, "incrementing" backup could be implemented without redo - I have
-some thoughts about this, - but having additional recovering is good
-in any case).
-
-Vadim
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id PAA22341
- for ; Fri, 7 Nov 1997 15:42:55 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id PAA02769; Fri, 7 Nov 1997 15:28:54 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 07 Nov 1997 15:24:00 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id PAA01318 for pgsql-hackers-outgoing; Fri, 7 Nov 1997 15:23:52 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id PAA00705 for
; Fri, 7 Nov 1997 15:21:56 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id PAA20010;
- Fri, 7 Nov 1997 15:20:10 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Fri, 7 Nov 1997 15:20:10 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-> > Anything that promises that much speedup, if it can be done easy, I say
-> > lets consider it, even if you loose 60 seconds of changes.
->
-> I agreed with your un-buffered logging idea. This would be excellent
-> feature for un-critical dbase usings (WWW, etc).
-
-Actually, it is buffered logging. We currently have unbuffered logging,
-I think.
-
-> > > In my plans to re-design transaction system I supposed to keep in shmem
-> > > two last pg_log pages. They are most often used and using ReadBuffer/WriteBuffer
-> > > to access them is not good idea. Also, we could use spinlock instead of
-> > > lock manager to synchronize access to these pages (as I see in spin.c
-> > > spinlock-s could be shared, but only exclusive ones are used) - spinlocks
-> > > are faster.
-> >
-> > Ah, so you already had the idea of having on-line pages in shared memory
-> > as part of a transaction system overhaul? Right now, does each backend
->
-> Yes. I hope to implement this in the next 1-2 weeks.
->
-> > lock/read/write/unlock to get at pg_log? Wow, that is bad.
->
-> Yes, he does.
->
-> >
-> > Perhaps mmap() would be a good idea. My system has msync() to flush
-> > mmap()'ed pages to the underlying file. You would still run fsync()
-> > after that. This may give us the best of both worlds: a shared-memory
-> ^^^^^^^^^^^^^
-> > area of variable size, and control of when it get flushed to disk. Do
-> ^^^^^^^^^^^^^^^^^^^^^
-> I like it. FreeBSD supports
->
-> MAP_ANON Map anonymous memory not associated with any specific file.
->
-> It would be nice to use mmap to get more "shared" memory, but I don't see
-> reasons to mmap any particular file to memory. Having two last pg_log pages
-> in memory + xact commit/abort writeback optimization (updation of commit/abort
-> xmin/xmax status in tuples by any scan - we already have this) reduce access
-> to "old" pg_log pages to zero.
-
-I totally agree. There is no advantage to mmap() vs. shared memory for
-us. I thought if we could control when the mmap() gets flushed to disk,
-we could let the OS handle the syncing, but I doubt this is going to be
-portable.
-
-Though, we could mmap() pg_log, and that way backends would not have to
-read/write the blocks, and they could all see the same data. But with
-the new scheme, they have most transaction ids in shared memory.
-
-Interesting you mention the scan updating the transaction status. We
-would have a problem here. It is possible a backend will update the
-commit status of a data page, and that data page will make it to disk,
-but if there is a crash before the update pg_log gets sync'ed, there
-would be a partial transaction in the system.
-
-I don't know any way that a backend would know the transaction has hit
-disk, and the data commit flag could be set. You don't want to update
-the commit flag of the data page until entire transaction has been
-sync'ed. The only way to do that would be to have a 'commit and synced'
-flag, but you want to save that for nested transactions.
-
-Another case this could come in handy is to allow reuse of superceeded
-data rows. If the transaction is committed and synced, the row space
-could be reused by another transaction.
-
-> > other OS's have this? I have a feeling OS's with unified buffer caches
-> > don't have this ability to determine when the underlying mmap'ed file
-> > gets sent to the underlying file and disk.
-> >
-> > > These two last pg_log pages are "online" ones. Race condition: when one or
-> > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-> > > when writing commit/abort of "big" xid. This is how we could handle this
-> > > in "buffered" logging (delayed fsync) mode:
-> > >
-> > > When backend want to write commit/abort status he acquires exclusive
-> > > OnLineLogLock. If xid belongs to online pages then backend writes status
-
-This confuses me. Why does a backend need to lock pg_log to update a
-transaction status?
-
-> > > and releases spin. If xid is less than least xid on 1st online page then
-> > > backend releases spin and does exactly the same what he does in normal mode:
-> > > flush (write and fsync) all durty data files, lock pg_log for write, ReadBuffer,
-> > > update xid status, WriteBuffer, release write lock, flush pg_log.
-> > > If xid is greater than max xid on 2nd online page then the simplest way is
-> > > just do sync(); sync() (two times), flush 1st or both online pages,
-> > > read new page(s) into online pages space, update xid status,
-> > > release OnLineLogLock spin. We could try other ways but pg_log expanding
-> > > is rare case (32K xids in one pg_log page)...
-> > > All what postmaster will have to do is:
-> > > 1. Get shared OnLineLogLock.
-> > > 2. Copy 2 x 8K data to private place.
-> > > 3. Release spinlock.
-> > > 4. sync(); sync(); (two times!)
-> > > 5. Flush online pages.
-
-Great.
-
-> > >
-> > > We could use -F DELAY_TIME to turn fsync delayed mode ON.
-> > >
-> > > And, btw, having two bits for xact status we have only one unused
-> > > status value (0x11) currently - I would like to use this for
-> > > nested xactions and savepoints...
-> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-> More about this: 0x11 could mean "this _child_ transaction is committed -
-> you have to lookup in pg_xact_child to get parent xid and use pg_log again
-> to get parent xact status". If parent committed then child xact status
-> will be changed to 0x10 (committed) else - to 0x01 (aborted). Using this
-> we could get xact nesting and savepoints by starting new child xaction
-> inside running one...
-
-OK.
-
->
-> >
-> > I saw that. By keeping two copies of pg_log, one in memory to be used
-> ^^^^^^
-> Just two pg_log pages...
-
-Got it.
-
-
---
-Bruce Momjian
-
-
-Received: from hub.org (hub.org [209.47.148.200])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id WAA04655
- for ; Sun, 9 Nov 1997 22:07:30 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id VAA07023; Sun, 9 Nov 1997 21:55:54 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 21:52:20 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id VAA06174 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 21:52:13 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id VAA06092 for
; Sun, 9 Nov 1997 21:51:58 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id VAA04150;
- Sun, 9 Nov 1997 21:50:29 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel! (fwd)
-Date: Sun, 9 Nov 1997 21:50:29 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-Forwarded message:
-> > > Perhaps mmap() would be a good idea. My system has msync() to flush
-> > > mmap()'ed pages to the underlying file. You would still run fsync()
-> > > after that. This may give us the best of both worlds: a shared-memory
-> > ^^^^^^^^^^^^^
-> > > area of variable size, and control of when it get flushed to disk. Do
-> > ^^^^^^^^^^^^^^^^^^^^^
-> > I like it. FreeBSD supports
-> >
-> > MAP_ANON Map anonymous memory not associated with any specific file.
-> >
-> > It would be nice to use mmap to get more "shared" memory, but I don't see
-> > reasons to mmap any particular file to memory. Having two last pg_log pages
-> > in memory + xact commit/abort writeback optimization (updation of commit/abort
-> > xmin/xmax status in tuples by any scan - we already have this) reduce access
-> > to "old" pg_log pages to zero.
->
-> I totally agree. There is no advantage to mmap() vs. shared memory for
-> us. I thought if we could control when the mmap() gets flushed to disk,
-> we could let the OS handle the syncing, but I doubt this is going to be
-> portable.
->
-> Though, we could mmap() pg_log, and that way backends would not have to
-> read/write the blocks, and they could all see the same data. But with
-> the new scheme, they have most transaction ids in shared memory.
->
-> Interesting you mention the scan updating the transaction status. We
-> would have a problem here. It is possible a backend will update the
-> commit status of a data page, and that data page will make it to disk,
-> but if there is a crash before the update pg_log gets sync'ed, there
-> would be a partial transaction in the system.
->
-> I don't know any way that a backend would know the transaction has hit
-> disk, and the data commit flag could be set. You don't want to update
-> the commit flag of the data page until entire transaction has been
-> sync'ed. The only way to do that would be to have a 'commit and synced'
-> flag, but you want to save that for nested transactions.
->
-> Another case this could come in handy is to allow reuse of superceeded
-> data rows. If the transaction is committed and synced, the row space
-> could be reused by another transaction.
->
-
-I have been thinking about the mmap() issue, and it seems a natural for
-pg_log. You can have every backend mmap() pg_log. It becomes a dynamic
-shared memory area that is auto-initialized to the contents of pg_log,
-and all changes can be made by all backends. No locking needed. We can
-also flush the changes to the underlying file. Under bsdi, you can also
-have the mmap area follow you across exec() calls, so each backend
-doesn't have to do anything. I want to replace exec with fork also, so
-the stuff would be auto-loaded in the address space of each backend.
-
-This way, you don't have to have two on-line pages and move them around
-as pg_log grows.
-
-The only problem remains how to mark certain transactions as synced or
-force only synced transactions to hit the pg_log file itself, and data
-row commit status only should be updated for synced transactions.
-
---
-Bruce Momjian
-
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA05394
- for ; Sun, 9 Nov 1997 23:00:55 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id WAA25139 for ; Sun, 9 Nov 1997 22:42:33 -0500 (EST)
-Received: from www.krasnet.ru (www.krasnet.ru [193.125.44.86]) by www.krasnet.ru (8.8.7/8.7.3) with SMTP id KAA01845; Mon, 10 Nov 1997 10:49:25 +0700 (KRS)
-Date: Mon, 10 Nov 1997 10:49:24 +0700
-From: "Vadim B. Mikheev"
-Organization: ITTS (Krasnoyarsk)
-X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
-MIME-Version: 1.0
-To: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: OR
-
-Bruce Momjian wrote:
->
-> > > Anything that promises that much speedup, if it can be done easy, I say
-> > > lets consider it, even if you loose 60 seconds of changes.
-> >
-> > I agreed with your un-buffered logging idea. This would be excellent
-> > feature for un-critical dbase usings (WWW, etc).
->
-> Actually, it is buffered logging. We currently have unbuffered logging,
-> I think.
-
-Sorry - mistyping.
-
->
-> Interesting you mention the scan updating the transaction status. We
-> would have a problem here. It is possible a backend will update the
-> commit status of a data page, and that data page will make it to disk,
-> but if there is a crash before the update pg_log gets sync'ed, there
-> would be a partial transaction in the system.
-
-You're right! Currently, only system relations can be affected by this:
-backend releases locks on user tables after syncing data and pg_log.
-I'll keep this in mind...
-
-> > > > These two last pg_log pages are "online" ones. Race condition: when one or
-> > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-> > > > when writing commit/abort of "big" xid. This is how we could handle this
-> > > > in "buffered" logging (delayed fsync) mode:
-> > > >
-> > > > When backend want to write commit/abort status he acquires exclusive
-> > > > OnLineLogLock. If xid belongs to online pages then backend writes status
->
-> This confuses me. Why does a backend need to lock pg_log to update a
-> transaction status?
-
-What if two backends try to change xact statuses in the same byte ?
-
-Vadim
-
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id XAA06523
- for ; Sun, 9 Nov 1997 23:59:48 -0500 (EST)
-Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.14 $) with ESMTP id XAA27105 for ; Sun, 9 Nov 1997 23:41:39 -0500 (EST)
-Received: from localhost (majordom@localhost) by hub.org (8.8.5/8.7.5) with SMTP id XAA08860; Sun, 9 Nov 1997 23:35:42 -0500 (EST)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 09 Nov 1997 23:31:50 -0500 (EST)
-Received: (from majordom@localhost) by hub.org (8.8.5/8.7.5) id XAA07962 for pgsql-hackers-outgoing; Sun, 9 Nov 1997 23:31:43 -0500 (EST)
-Received: from candle.pha.pa.us (
[email protected] [206.84.210.195]) by hub.org (8.8.5/8.7.5) with ESMTP id XAA07875 for
; Sun, 9 Nov 1997 23:31:28 -0500 (EST)
-Received: (from maillist@localhost)
- by candle.pha.pa.us (8.8.5/8.8.5) id XAA05566;
- Sun, 9 Nov 1997 23:17:41 -0500 (EST)
-From: Bruce Momjian
-Subject: Re: [HACKERS] PERFORMANCE and Good Bye, Time Travel!
-Date: Sun, 9 Nov 1997 23:17:41 -0500 (EST)
-X-Mailer: ELM [version 2.4 PL25]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Status: OR
-
-> > > > > These two last pg_log pages are "online" ones. Race condition: when one or
-> > > > > both of online pages becomes non-online ones, i.e. pg_log has to be expanded
-> > > > > when writing commit/abort of "big" xid. This is how we could handle this
-> > > > > in "buffered" logging (delayed fsync) mode:
-> > > > >
-> > > > > When backend want to write commit/abort status he acquires exclusive
-> > > > > OnLineLogLock. If xid belongs to online pages then backend writes status
-> >
-> > This confuses me. Why does a backend need to lock pg_log to update a
-> > transaction status?
->
-> What if two backends try to change xact statuses in the same byte ?
-
-Ooo, you got me. I so hoped to prevent locking. It would be nice if:
-
- *x |= 3;
-
-would be atomic, but I don't think it is. Most RISC machines don't even
-have an OR against a memory address, I think.
-
---
-Bruce Momjian
-
-