TIP 2: you can get off all lists at once with the unregister command
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P6RSF12626
+ for
; Tue, 25 Jun 2002 02:27:28 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 2C72F475EF6; Tue, 25 Jun 2002 02:27:28 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 42AAB475B26; Tue, 25 Jun 2002 02:07:04 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id A8D13475A06
+ for
; Tue, 25 Jun 2002 02:07:01 -0400 (EDT)
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by postgresql.org (Postfix) with ESMTP id F3C264760A1
+ for
; Tue, 25 Jun 2002 01:05:49 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id 5F61CF820; Tue, 25 Jun 2002 05:05:47 +0000 (UTC)
+Date: Tue, 25 Jun 2002 14:05:45 +0900 (JST)
+From: Curt Sampson
+To: "J. R. Nield"
+cc: Bruce Momjian
, Tom Lane ,
+Subject: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-5.3 required=5.0
+ tests=IN_REP_TO,X_NOT_PRESENT
+ version=2.30
+Status: OR
+
+I'm splitting off this buffer mangement stuff into a separate thread.
+
+On 24 Jun 2002, J. R. Nield wrote:
+
+> I'll back off on that. I don't know if we want to use the OS buffer
+> manager, but shouldn't we try to have our buffer manager group writes
+> together by files, and pro-actively get them out to disk?
+
+The only way the postgres buffer manager can "get [data] out to disk"
+is to do an fsync(). For data files (as opposed to log files), this can
+only slow down overall system throughput, as this would only disrupt the
+OS's write management.
+
+> Right now, it
+> looks like all our write requests are delayed as long as possible and
+> the order in which they are written is pretty-much random, as is the
+> backend that writes the block, so there is no locality of reference even
+> when the blocks are adjacent on disk, and the write calls are spread-out
+> over all the backends.
+
+It doesn't matter. The OS will introduce locality of reference with its
+write algorithms. Take a look at
+
+ http://www.cs.wisc.edu/~solomon/cs537/disksched.html
+
+for an example. Most OSes use the elevator or one-way elevator
+algorithm. So it doesn't matter whether it's one back-end or many
+writing, and it doesn't matter in what order they do the write.
+
+> Would it not be the case that things like read-ahead, grouping writes,
+> and caching written data are probably best done by PostgreSQL, because
+> only our buffer manager can understand when they will be useful or when
+> they will thrash the cache?
+
+Operating systems these days are not too bad at guessing guessing what
+you're doing. Pretty much every OS I've seen will do read-ahead when
+it detects you're doing sequential reads, at least in the forward
+direction. And Solaris is even smart enough to mark the pages you've
+read as "not needed" so that they quickly get flushed from the cache,
+rather than blowing out your entire cache if you go through a large
+file.
+
+> Would O_DSYNC|O_RSYNC turn off the cache?
+
+No. I suppose there's nothing to stop it doing so, in some
+implementations, but the interface is not designed for direct I/O.
+
+> Since you know a lot about NetBSD internals, I'd be interested in
+> hearing about what postgresql looks like to the NetBSD buffer manager.
+
+Well, looks like pretty much any program, or group of programs,
+doing a lot of I/O. :-)
+
+> Am I right that strings of successive writes get randomized?
+
+No; as I pointed out, they in fact get de-randomized as much as
+possible. The more proceses you have throwing out requests, the better
+the throughput will be in fact.
+
+> What do our cache-hit percentages look like? I'm going to do some
+> experimenting with this.
+
+Well, that depends on how much memory you have and what your working
+set is. :-)
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+
+
+Return-path:
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDqKF07478
+ for
; Tue, 25 Jun 2002 09:52:22 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id D9242F820; Tue, 25 Jun 2002 13:52:18 +0000 (UTC)
+Date: Tue, 25 Jun 2002 22:52:14 +0900 (JST)
+From: Curt Sampson
+To: "J. R. Nield"
+cc: Bruce Momjian
, Tom Lane ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+
+So, while we're at it, what's the current state of people's thinking
+on using mmap rather than shared memory for data file buffers? I
+see some pretty powerful advantages to this approach, and I'm not
+(yet :-)) convinced that the disadvantages are as bad as people think.
+I think I can address most of the concerns in doc/TODO.detail/mmap.
+
+Is this worth pursuing a bit? (I.e., should I spend an hour or two
+writing up the advantages and thoughts on how to get around the
+problems?) Anybody got objections that aren't in doc/TODO.detail/mmap?
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+Return-path:
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PE96F08922
+ for
; Tue, 25 Jun 2002 10:09:06 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+ by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PE92107301;
+ Tue, 25 Jun 2002 10:09:02 -0400 (EDT)
+To: Curt Sampson
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+Comments: In-reply-to Curt Sampson
+ message dated "Tue, 25 Jun 2002 22:52:14 +0900"
+Date: Tue, 25 Jun 2002 10:09:02 -0400
+From: Tom Lane
+Status: ORr
+
+Curt Sampson writes:
+> So, while we're at it, what's the current state of people's thinking
+> on using mmap rather than shared memory for data file buffers?
+
+There seem to be a couple of different threads in doc/TODO.detail/mmap.
+
+One envisions mmap as a one-for-one replacement for our current use of
+SysV shared memory, the main selling point being to get out from under
+kernels that don't have SysV support or have it configured too small.
+This might be worth doing, and I think it'd be relatively easy to do
+now that the shared memory support is isolated in one file and there's
+provisions for selecting a shmem implementation at configure time.
+The only thing you'd really have to think about is how to replace the
+current behavior that uses shmem attach counts to discover whether any
+old backends are left over from a previous crashed postmaster. I dunno
+if mmap offers any comparable facility.
+
+The other discussion seemed to be considering how to mmap individual
+data files right into backends' address space. I do not believe this
+can possibly work, because of loss of control over visibility of data
+changes to other backends, timing of write-backs, etc.
+
+But as long as you stay away from interpretation #2 and go with
+mmap-as-a-shmget-substitute, it might be worthwhile.
+
+(Hey Marc, can one do mmap in a BSD jail?)
+
+ regards, tom lane
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEKgF10228
+ for
; Tue, 25 Jun 2002 10:20:42 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 7259547609E; Tue, 25 Jun 2002 10:20:35 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 8E79647604C; Tue, 25 Jun 2002 10:20:33 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id C3EB1476002
+ for
; Tue, 25 Jun 2002 10:20:30 -0400 (EDT)
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by postgresql.org (Postfix) with ESMTP id 887F9475B2F
+ for
; Tue, 25 Jun 2002 10:20:16 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id 16CCDF820; Tue, 25 Jun 2002 14:20:19 +0000 (UTC)
+Date: Tue, 25 Jun 2002 23:20:15 +0900 (JST)
+From: Curt Sampson
+To: Tom Lane
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-5.3 required=5.0
+ tests=IN_REP_TO,X_NOT_PRESENT
+ version=2.30
+Status: OR
+
+On Tue, 25 Jun 2002, Tom Lane wrote:
+
+> The only thing you'd really have to think about is how to replace the
+> current behavior that uses shmem attach counts to discover whether any
+> old backends are left over from a previous crashed postmaster. I dunno
+> if mmap offers any comparable facility.
+
+Sure. Just mmap a file, and it will be persistent.
+
+> The other discussion seemed to be considering how to mmap individual
+> data files right into backends' address space. I do not believe this
+> can possibly work, because of loss of control over visibility of data
+> changes to other backends, timing of write-backs, etc.
+
+I don't understand why there would be any loss of visibility of changes.
+If two backends mmap the same block of a file, and it's shared, that's
+the same block of physical memory that they're accessing. Changes don't
+even need to "propagate," because the memory is truly shared. You'd keep
+your locks in the page itself as well, of course.
+
+Can you describe the problem in more detail?
+
+> But as long as you stay away from interpretation #2 and go with
+> mmap-as-a-shmget-substitute, it might be worthwhile.
+
+It's #2 that I was really looking at. :-)
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 2: you can get off all lists at once with the unregister command
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEPKF10831
+ for
; Tue, 25 Jun 2002 10:25:20 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id AA2EF475C46; Tue, 25 Jun 2002 10:25:13 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 9657447603B; Tue, 25 Jun 2002 10:23:23 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 364D0475FC2
+ for
; Tue, 25 Jun 2002 10:23:18 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+ by postgresql.org (Postfix) with ESMTP id C063F47594B
+ for
; Tue, 25 Jun 2002 10:20:35 -0400 (EDT)
+Received: (from pgman@localhost)
+ by candle.pha.pa.us (8.11.6/8.10.1) id g5PEKT310222;
+ Tue, 25 Jun 2002 10:20:29 -0400 (EDT)
+Subject: Re: [HACKERS] Buffer Management
+To: Tom Lane
+Date: Tue, 25 Jun 2002 10:20:29 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-3.4 required=5.0
+ tests=IN_REP_TO
+ version=2.30
+Status: OR
+
+Tom Lane wrote:
+> Curt Sampson writes:
+> > So, while we're at it, what's the current state of people's thinking
+> > on using mmap rather than shared memory for data file buffers?
+>
+> There seem to be a couple of different threads in doc/TODO.detail/mmap.
+>
+> One envisions mmap as a one-for-one replacement for our current use of
+> SysV shared memory, the main selling point being to get out from under
+> kernels that don't have SysV support or have it configured too small.
+> This might be worth doing, and I think it'd be relatively easy to do
+> now that the shared memory support is isolated in one file and there's
+> provisions for selecting a shmem implementation at configure time.
+> The only thing you'd really have to think about is how to replace the
+> current behavior that uses shmem attach counts to discover whether any
+> old backends are left over from a previous crashed postmaster. I dunno
+> if mmap offers any comparable facility.
+>
+> The other discussion seemed to be considering how to mmap individual
+> data files right into backends' address space. I do not believe this
+> can possibly work, because of loss of control over visibility of data
+> changes to other backends, timing of write-backs, etc.
+
+Agreed. Also, there was in intresting thread that mmap'ing /dev/zero is
+the same as anonmap for OS's that don't have anonmap. That should cover
+most of them. The only downside I can see is that SysV shared memory is
+locked into RAM on some/most OS's while mmap anon probably isn't.
+Locking in RAM is good in most cases, bad in others.
+
+This will also work well when we have non-SysV semaphore support, like
+Posix semaphores, so we would be able to run with no SysV stuff.
+
+--
+ Bruce Momjian | http://candle.pha.pa.us
+ + If your life is a hard drive, | 830 Blythe Avenue
+ + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEReF11147
+ for
; Tue, 25 Jun 2002 10:27:40 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id B33CD476047; Tue, 25 Jun 2002 10:27:16 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 3091247606D; Tue, 25 Jun 2002 10:23:24 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 6C39D476002
+ for
; Tue, 25 Jun 2002 10:23:19 -0400 (EDT)
+Received: from internet.csl.co.uk (internet.csl.co.uk [194.130.52.3])
+ by postgresql.org (Postfix) with ESMTP id AC203475C46
+ for
; Tue, 25 Jun 2002 10:20:49 -0400 (EDT)
+Received: from euphrates.csl.co.uk (host-194-67.csl.co.uk [194.130.52.67])
+ by internet.csl.co.uk (8.12.1/8.12.1) with ESMTP id g5PEKonH023514;
+ Tue, 25 Jun 2002 15:20:50 +0100
+Received: from kelvin.csl.co.uk by euphrates.csl.co.uk (8.9.3/ConceptI 2.4)
+ id PAA08847; Tue, 25 Jun 2002 15:20:52 +0100 (BST)
+Received: by kelvin.csl.co.uk (8.11.6) id g5PEKoT28846; Tue, 25 Jun 2002 15:20:50 +0100
+From: Lee Kindness
+MIME-Version: 1.0
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Date: Tue, 25 Jun 2002 15:20:49 +0100
+To: Tom Lane
+Subject: Re: [HACKERS] Buffer Management
+X-Mailer: VM 7.00 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid
+Precedence: bulk
+X-Spam-Status: No, hits=-3.4 required=5.0
+ tests=IN_REP_TO
+ version=2.30
+Status: OR
+
+Tom Lane writes:
+ > There seem to be a couple of different threads in
+ > doc/TODO.detail/mmap.
+ > [ snip ]
+
+A place where mmap could be easily used and would offer a good
+performance increase is for COPY FROM.
+
+Lee.
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+
+
+Return-path:
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEOmF10749
+ for
; Tue, 25 Jun 2002 10:24:49 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id F2629F820; Tue, 25 Jun 2002 14:24:47 +0000 (UTC)
+Date: Tue, 25 Jun 2002 23:24:44 +0900 (JST)
+From: Curt Sampson
+cc: Tom Lane , "J. R. Nield" ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Tue, 25 Jun 2002, Bruce Momjian wrote:
+
+> The only downside I can see is that SysV shared memory is
+> locked into RAM on some/most OS's while mmap anon probably isn't.
+
+It is if you mlock() it. :-)
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+Return-path:
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PETpF11341
+ for
; Tue, 25 Jun 2002 10:29:52 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+ by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PETn107501;
+ Tue, 25 Jun 2002 10:29:49 -0400 (EDT)
+To: Curt Sampson
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+Comments: In-reply-to Curt Sampson
+ message dated "Tue, 25 Jun 2002 23:20:15 +0900"
+Date: Tue, 25 Jun 2002 10:29:49 -0400
+From: Tom Lane
+Status: ORr
+
+Curt Sampson writes:
+> On Tue, 25 Jun 2002, Tom Lane wrote:
+>> The other discussion seemed to be considering how to mmap individual
+>> data files right into backends' address space. I do not believe this
+>> can possibly work, because of loss of control over visibility of data
+>> changes to other backends, timing of write-backs, etc.
+
+> I don't understand why there would be any loss of visibility of changes.
+> If two backends mmap the same block of a file, and it's shared, that's
+> the same block of physical memory that they're accessing.
+
+Is it? You have a mighty narrow conception of the range of
+implementations that's possible for mmap.
+
+But the main problem is that mmap doesn't let us control when changes to
+the memory buffer will get reflected back to disk --- AFAICT, the OS is
+free to do the write-back at any instant after you dirty the page, and
+that completely breaks the WAL algorithm. (WAL = write AHEAD log;
+the log entry describing a change must hit disk before the data page
+change itself does.)
+
+ regards, tom lane
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEicF14506
+ for
; Tue, 25 Jun 2002 10:44:38 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id E20F8476322; Tue, 25 Jun 2002 10:44:27 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 47B4847609E; Tue, 25 Jun 2002 10:34:29 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 52A5F475E5F
+ for
; Tue, 25 Jun 2002 10:34:25 -0400 (EDT)
+Received: from sss.pgh.pa.us (unknown [192.204.191.242])
+ by postgresql.org (Postfix) with ESMTP id 458BB476239
+ for
; Tue, 25 Jun 2002 10:32:12 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+ by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PEWA107527;
+ Tue, 25 Jun 2002 10:32:10 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+Subject: Re: [HACKERS] Buffer Management
+Comments: In-reply-to Bruce Momjian
+ message dated "Tue, 25 Jun 2002 10:20:29 -0400"
+Date: Tue, 25 Jun 2002 10:32:10 -0400
+From: Tom Lane
+Precedence: bulk
+X-Spam-Status: No, hits=-5.3 required=5.0
+ tests=IN_REP_TO,X_NOT_PRESENT
+ version=2.30
+Status: ORr
+
+> This will also work well when we have non-SysV semaphore support, like
+> Posix semaphores, so we would be able to run with no SysV stuff.
+
+You do realize that we can use Posix semaphores today? The Darwin (OS X)
+port uses 'em now. That's one reason I am more interested in mmap as
+a shmget substitute than I used to be.
+
+ regards, tom lane
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF2JF16153
+ for
; Tue, 25 Jun 2002 11:02:20 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 7FB0F47630C; Tue, 25 Jun 2002 11:02:11 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id B755E475C22; Tue, 25 Jun 2002 10:59:45 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 7D058476387
+ for
; Tue, 25 Jun 2002 10:59:38 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+ by postgresql.org (Postfix) with ESMTP id 49F8C475DC6
+ for
; Tue, 25 Jun 2002 10:56:00 -0400 (EDT)
+Received: (from pgman@localhost)
+ by candle.pha.pa.us (8.11.6/8.10.1) id g5PEtst15464;
+ Tue, 25 Jun 2002 10:55:54 -0400 (EDT)
+Subject: Re: [HACKERS] Buffer Management
+To: Tom Lane
+Date: Tue, 25 Jun 2002 10:55:54 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-3.4 required=5.0
+ tests=IN_REP_TO
+ version=2.30
+Status: OR
+
+Tom Lane wrote:
+> > This will also work well when we have non-SysV semaphore support, like
+> > Posix semaphores, so we would be able to run with no SysV stuff.
+>
+> You do realize that we can use Posix semaphores today? The Darwin (OS X)
+> port uses 'em now. That's one reason I am more interested in mmap as
+
+No, I didn't realize we had gotten that far.
+
+--
+ Bruce Momjian | http://candle.pha.pa.us
+ + If your life is a hard drive, | 830 Blythe Avenue
+ + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 2: you can get off all lists at once with the unregister command
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF5CF16398
+ for
; Tue, 25 Jun 2002 11:05:13 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 30D2847634D; Tue, 25 Jun 2002 11:05:04 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id B49B5475EFA; Tue, 25 Jun 2002 10:59:47 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id A0F20475978
+ for
; Tue, 25 Jun 2002 10:59:43 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+ by postgresql.org (Postfix) with ESMTP id 8160E4762F0
+ for
; Tue, 25 Jun 2002 10:57:03 -0400 (EDT)
+Received: (from pgman@localhost)
+ by candle.pha.pa.us (8.11.6/8.10.1) id g5PEuwO15564;
+ Tue, 25 Jun 2002 10:56:58 -0400 (EDT)
+Subject: Re: [HACKERS] Buffer Management
+To: Tom Lane
+Date: Tue, 25 Jun 2002 10:56:58 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-2.3 required=5.0
+ tests=IN_REP_TO,DOUBLE_CAPSWORD
+ version=2.30
+Status: OR
+
+Tom Lane wrote:
+> Curt Sampson writes:
+> > On Tue, 25 Jun 2002, Tom Lane wrote:
+> >> The other discussion seemed to be considering how to mmap individual
+> >> data files right into backends' address space. I do not believe this
+> >> can possibly work, because of loss of control over visibility of data
+> >> changes to other backends, timing of write-backs, etc.
+>
+> > I don't understand why there would be any loss of visibility of changes.
+> > If two backends mmap the same block of a file, and it's shared, that's
+> > the same block of physical memory that they're accessing.
+>
+> Is it? You have a mighty narrow conception of the range of
+> implementations that's possible for mmap.
+>
+> But the main problem is that mmap doesn't let us control when changes to
+> the memory buffer will get reflected back to disk --- AFAICT, the OS is
+> free to do the write-back at any instant after you dirty the page, and
+> that completely breaks the WAL algorithm. (WAL = write AHEAD log;
+> the log entry describing a change must hit disk before the data page
+> change itself does.)
+
+Can we mmap WAL without problems? Not sure if there is any gain to it
+because we just write it and rarely read from it.
+
+--
+ Bruce Momjian | http://candle.pha.pa.us
+ + If your life is a hard drive, | 830 Blythe Avenue
+ + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 2: you can get off all lists at once with the unregister command
+
+
+
+Return-path:
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF0JF15955
+ for
; Tue, 25 Jun 2002 11:00:19 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+ by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PF0J107808;
+ Tue, 25 Jun 2002 11:00:19 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+Subject: Re: [HACKERS] Buffer Management
+Comments: In-reply-to Bruce Momjian
+ message dated "Tue, 25 Jun 2002 10:56:58 -0400"
+Date: Tue, 25 Jun 2002 11:00:19 -0400
+From: Tom Lane
+Status: ORr
+
+> Can we mmap WAL without problems? Not sure if there is any gain to it
+> because we just write it and rarely read from it.
+
+Perhaps, but I don't see any point to it.
+
+ regards, tom lane
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PFENF17356
+ for
; Tue, 25 Jun 2002 11:14:23 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 8EAA3476244; Tue, 25 Jun 2002 11:14:09 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id C32024762B0; Tue, 25 Jun 2002 11:10:33 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 1F81C4762A2
+ for
; Tue, 25 Jun 2002 11:10:31 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+ by postgresql.org (Postfix) with ESMTP id CE09D475B33
+ for
; Tue, 25 Jun 2002 11:02:10 -0400 (EDT)
+Received: (from pgman@localhost)
+ by candle.pha.pa.us (8.11.6/8.10.1) id g5PF25r16113;
+ Tue, 25 Jun 2002 11:02:05 -0400 (EDT)
+Subject: Re: [HACKERS] Buffer Management
+To: Tom Lane
+Date: Tue, 25 Jun 2002 11:02:05 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-3.4 required=5.0
+ tests=IN_REP_TO
+ version=2.30
+Status: OR
+
+Tom Lane wrote:
+> > Can we mmap WAL without problems? Not sure if there is any gain to it
+> > because we just write it and rarely read from it.
+>
+> Perhaps, but I don't see any point to it.
+
+Agreed. I have been poking around google looking for an article I read
+months ago saying that mmap of files is slighly faster in low memory
+usage situations, but much slower in high memory usage situations
+because the kernel doesn't know as much about the file access in mmap as
+it does with stdio. I will find it. :-)
+
+--
+ Bruce Momjian | http://candle.pha.pa.us
+ + If your life is a hard drive, | 830 Blythe Avenue
+ + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PGDdF22106
+ for
; Tue, 25 Jun 2002 12:13:39 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 962BD4762AF; Tue, 25 Jun 2002 12:13:32 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 06727476181; Tue, 25 Jun 2002 12:13:31 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id AB1CB4760F7
+ for
; Tue, 25 Jun 2002 12:13:28 -0400 (EDT)
+Received: from bradm.net (208-59-250-198.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com [208.59.250.198])
+ by postgresql.org (Postfix) with ESMTP id 594BD476083
+ for
; Tue, 25 Jun 2002 12:13:27 -0400 (EDT)
+Received: (from brad@localhost)
+ by bradm.net (8.11.6/8.11.6) id g5PGCjA14829;
+ Tue, 25 Jun 2002 12:12:45 -0400
+Date: Tue, 25 Jun 2002 12:12:45 -0400
+From: Bradley McLean
+To: Tom Lane
+cc: Mario Weilguni ,
+ Curt Sampson , "J. R. Nield" ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: text/plain; charset=us-ascii
+Content-Disposition: inline
+User-Agent: Mutt/1.2.5.1i
+Precedence: bulk
+X-Spam-Status: No, hits=-4.2 required=5.0
+ tests=IN_REP_TO,X_NOT_PRESENT,DOUBLE_CAPSWORD
+ version=2.30
+Status: OR
+
+>
+> msync can force not-yet-written changes down to disk. It does not
+> prevent the OS from choosing to write changes *before* you invoke msync.
+>
+> Our problem is that we want to enforce the write ordering "WAL before
+> data file". To do that, we write and fsync (or DSYNC, or something)
+> a WAL entry before we issue the write() against the data file. We
+> don't really care if the kernel delays the data file write beyond that
+> point, but we can be certain that the data file write did not occur
+> too early.
+>
+> msync is designed to ensure exactly the opposite constraint: it can
+> guarantee that no changes remain unwritten after time T, but it can't
+> guarantee that changes aren't written before time T.
+
+Okay, so instead of looking for constraints from the OS on the data file,
+use the constraints on the WAL file. It would work at the cost of a buffer
+copy? Er, maybe two:
+
+mmap the data file and WAL separately.
+Copy the data file page to the WAL mmap area.
+Modify the page.
+msync() the WAL.
+Copy the page to the data file mmap area.
+msync() or not the data file.
+
+(This is half baked, just thought I'd see if it stirred further thought).
+
+As another approach, how expensive is re-MMAPing portions of the files
+compared to the copies.
+
+-Brad
+
+>
+> regards, tom lane
+>
+>
+>
+> ---------------------------(end of broadcast)---------------------------
+> TIP 3: if posting/reading through Usenet, please send an appropriate
+> message can get through to the mailing list cleanly
+>
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+
+
+Return-path:
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5Q4Dig27201
+ for
; Wed, 26 Jun 2002 00:13:45 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id B95E5F820; Wed, 26 Jun 2002 04:13:45 +0000 (UTC)
+Date: Wed, 26 Jun 2002 13:13:42 +0900 (JST)
+From: Curt Sampson
+To: Tom Lane
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: OR
+
+On Tue, 25 Jun 2002, Tom Lane wrote:
+
+> Curt Sampson writes:
+>
+> > I don't understand why there would be any loss of visibility of changes.
+> > If two backends mmap the same block of a file, and it's shared, that's
+> > the same block of physical memory that they're accessing.
+>
+> Is it? You have a mighty narrow conception of the range of
+> implementations that's possible for mmap.
+
+It's certainly possible to implement something that you call mmap
+that is not. But if you are using the posix-defined MAP_SHARED flag,
+the behaviour above is what you see. It might be implemented slightly
+differently internally, but that's no concern of postgres. And I find
+it pretty unlikely that it would be implemented otherwise without good
+reason.
+
+Note that your proposal of using mmap to replace sysv shared memory
+relies on the behaviour I've described too. As well, if you're replacing
+sysv shared memory with an mmap'd file, you may end up doing excessive
+disk I/O on systems without the MAP_NOSYNC option. (Without this option,
+the update thread/daemon may ensure that every buffer is flushed to the
+backing store on disk every 30 seconds or so. You might be able to get
+around this by using a small file-backed area for things that need to
+persist after a crash, and a larger anonymous area for things that don't
+need to persist after a crash.)
+
+> But the main problem is that mmap doesn't let us control when changes to
+> the memory buffer will get reflected back to disk --- AFAICT, the OS is
+> free to do the write-back at any instant after you dirty the page, and
+> that completely breaks the WAL algorithm. (WAL = write AHEAD log;
+> the log entry describing a change must hit disk before the data page
+> change itself does.)
+
+Hm. Well ,we could try not to write the data to the page until
+after we receive notification that our WAL data is committed to
+stable storage. However, new the data has to be availble to all of
+the backends at the exact time that the commit happens. Perhaps a
+shared list of pending writes?
+
+Another option would be to just let it write, but on startup, scan
+all of the data blocks in the database for tuples that have a
+transaction ID later than the last one we updated to, and remove
+them. That could pretty darn expensive on a large database, though.
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+Return-path:
+Received: from sss.pgh.pa.us (root@[192.204.191.242])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5QDM3g26028
+ for
; Wed, 26 Jun 2002 09:22:04 -0400 (EDT)
+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
+ by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5QDLxv01699;
+ Wed, 26 Jun 2002 09:21:59 -0400 (EDT)
+To: Curt Sampson
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+Comments: In-reply-to Curt Sampson
+ message dated "Wed, 26 Jun 2002 13:13:42 +0900"
+Date: Wed, 26 Jun 2002 09:21:59 -0400
+From: Tom Lane
+Status: ORr
+
+Curt Sampson writes:
+> Note that your proposal of using mmap to replace sysv shared memory
+> relies on the behaviour I've described too.
+
+True, but I was not envisioning mapping an actual file --- at least
+on HPUX, the only way to generate an arbitrary-sized shared memory
+region is to use MAP_ANONYMOUS and not have the mmap'd area connected
+to any file at all. It's not farfetched to think that this aspect
+of mmap might work differently from mapping pieces of actual files.
+
+In practice of course we'd have to restrict use of any such
+implementation to platforms where mmap behaves reasonably ... according
+to our definition of "reasonably".
+
+ regards, tom lane
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5QKEag03467
+ for
; Wed, 26 Jun 2002 16:14:36 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id B10E9476B4D; Wed, 26 Jun 2002 15:16:32 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 6635E476DC0; Wed, 26 Jun 2002 14:31:10 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id 13F884765BD
+ for
; Wed, 26 Jun 2002 14:22:36 -0400 (EDT)
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+ by postgresql.org (Postfix) with ESMTP id 3F02D476EB3
+ for
; Wed, 26 Jun 2002 13:11:37 -0400 (EDT)
+Received: (from pgman@localhost)
+ by candle.pha.pa.us (8.11.6/8.10.1) id g5QHBJM15565;
+ Wed, 26 Jun 2002 13:11:19 -0400 (EDT)
+Subject: Re: [HACKERS] Buffer Management
+To: Tom Lane
+Date: Wed, 26 Jun 2002 13:11:19 -0400 (EDT)
+cc: Curt Sampson , "J. R. Nield" ,
+X-Mailer: ELM [version 2.4ME+ PL97 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-3.4 required=5.0
+ tests=IN_REP_TO
+ version=2.30
+Status: OR
+
+Tom Lane wrote:
+> Curt Sampson writes:
+> > Note that your proposal of using mmap to replace sysv shared memory
+> > relies on the behaviour I've described too.
+>
+> True, but I was not envisioning mapping an actual file --- at least
+> on HPUX, the only way to generate an arbitrary-sized shared memory
+> region is to use MAP_ANONYMOUS and not have the mmap'd area connected
+> to any file at all. It's not farfetched to think that this aspect
+> of mmap might work differently from mapping pieces of actual files.
+>
+> In practice of course we'd have to restrict use of any such
+> implementation to platforms where mmap behaves reasonably ... according
+> to our definition of "reasonably".
+
+Yes, I am told mapping /dev/zero is the same as the anon map.
+
+--
+ Bruce Momjian | http://candle.pha.pa.us
+ + If your life is a hard drive, | 830 Blythe Avenue
+ + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+
+
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5R3d9g02161
+ for
; Wed, 26 Jun 2002 23:39:09 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP
+ id 88BF4476287; Wed, 26 Jun 2002 23:38:56 -0400 (EDT)
+Received: from postgresql.org (postgresql.org [64.49.215.8])
+ by postgresql.org (Postfix) with SMTP
+ id 3C069476954; Wed, 26 Jun 2002 23:38:17 -0400 (EDT)
+Received: from localhost.localdomain (postgresql.org [64.49.215.8])
+ by localhost (Postfix) with ESMTP id A0397476941
+ for
; Wed, 26 Jun 2002 23:38:12 -0400 (EDT)
+Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
+ by postgresql.org (Postfix) with ESMTP id 2AA24475C40
+ for
; Wed, 26 Jun 2002 23:37:18 -0400 (EDT)
+Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
+ by academic.cynic.net (Postfix) with ESMTP
+ id 179D5F822; Thu, 27 Jun 2002 03:37:20 +0000 (UTC)
+Date: Thu, 27 Jun 2002 12:37:18 +0900 (JST)
+From: Curt Sampson
+To: Tom Lane
+cc: "J. R. Nield"
, Bruce Momjian ,
+Subject: Re: [HACKERS] Buffer Management
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Precedence: bulk
+X-Spam-Status: No, hits=-5.3 required=5.0
+ tests=IN_REP_TO,X_NOT_PRESENT
+ version=2.30
+Status: OR
+
+On Wed, 26 Jun 2002, Tom Lane wrote:
+
+> Curt Sampson writes:
+> > Note that your proposal of using mmap to replace sysv shared memory
+> > relies on the behaviour I've described too.
+>
+> True, but I was not envisioning mapping an actual file --- at least
+> on HPUX, the only way to generate an arbitrary-sized shared memory
+> region is to use MAP_ANONYMOUS and not have the mmap'd area connected
+> to any file at all. It's not farfetched to think that this aspect
+> of mmap might work differently from mapping pieces of actual files.
+
+I find it somewhat farfetched, for a couple of reasons:
+
+ 1. Memory mapped with the MAP_SHARED flag is shared memory,
+ anonymous or not. POSIX is pretty explicit about how this works,
+ and the "standard" for mmap that predates POSIX is the same.
+ Anonymous memory does not behave differently.
+
+ You could just as well say that some systems might exist such
+ that one process can write() a block to a file, and then another
+ might read() it afterwards but not see the changes. Postgres
+ should not try to deal with hypothetical systems that are so
+ completely broken.
+
+ 2. Mmap is implemented as part of a unified buffer cache system
+ on all of today's operating systems that I know of. The memory
+ is backed by swap space when anonymous, and by a specified file
+ when not anonymous; but the way these two are handled is
+ *exactly* the same internally.
+
+ Even on older systems without unified buffer cache, the behaviour
+ is the same between anonymous and file-backed mmap'd memory.
+ And there would be no point in making it otherwise. Mmap is
+ designed to let you share memory; why make a broken implementation
+ under certain circumstances?
+
+> In practice of course we'd have to restrict use of any such
+> implementation to platforms where mmap behaves reasonably ... according
+> to our definition of "reasonably".
+
+Of course. As we do already with regular I/O.
+
+cjs
+--
+Curt Sampson +81 90 7737 2974 http://www.netbsd.org
+ Don't you know, in this new Dark Age, we're all light. --XTC
+
+
+
+
+---------------------------(end of broadcast)---------------------------
+TIP 3: if posting/reading through Usenet, please send an appropriate
+message can get through to the mailing list cleanly
+
+
+