Previously there was no memory barrier enforcing correct memory ordering when
waiting for a free IO handle. However, in the much more common case of waiting
for IO to complete, memory barriers already were present.
On strongly ordered architectures like x86 this had no negative consequences,
but on some armv8 hardware (observed on Apple hardware), it was possible for
the update, in the IO worker, to PgAioHandle->state to become visible before
->distilled_result becoming visible, leading to rather confusing assertion
failures. The failures were rare enough that the bug sometimes took days to
reproduce when running 027_stream_regress in a loop.
Once finally debugged, it was easy enough to come up with a much quicker
repro: Trigger a lot of very fast IO by limiting io_combine_limit to 1 and
ensure that we always have to wait for a free handle by setting
io_max_concurrency to 1. Triggering lots of concurrent seqscans in that setup
triggers the issue within seconds.
One reason this was hard to debug was that the assertion failure most commonly
happened in WaitReadBuffers(), rather than in the AIO subsystem itself. The
assertions added in this commit make problems like this easier to understand.
Also add a comment to the IO worker explaining that we rely on the lwlock
acquisition for correct memory ordering.
I think it'd be good to add a tap test that stress tests buffer IO, but that's
material for a separate patch.
Thanks a lot to Alexander and Konstantin for all the debugging help.
Reported-by: Tom Lane
Reported-by: Alexander Lakhin
Investigated-by: Andres Freund
Investigated-by: Alexander Lakhin
Investigated-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf
pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
{
*state = ioh->state;
+
+ /*
+ * Ensure that we don't see an earlier state of the handle than ioh->state
+ * due to compiler or CPU reordering. This protects both ->generation as
+ * directly used here, and other fields in the handle accessed in the
+ * caller if the handle was not reused.
+ */
pg_read_barrier();
return ioh->generation != ref_generation;
* Note that no interrupts are processed between the state check
* and the call to reclaim - that's important as otherwise an
* interrupt could have already reclaimed the handle.
+ *
+ * Need to ensure that there's no reordering, in the more common
+ * paths, where we wait for IO, that's done by
+ * pgaio_io_was_recycled().
*/
+ pg_read_barrier();
pgaio_io_reclaim(ioh);
reclaimed++;
}
* check and the call to reclaim - that's important as
* otherwise an interrupt could have already reclaimed the
* handle.
+ *
+ * Need to ensure that there's no reordering, in the more
+ * common paths, where we wait for IO, that's done by
+ * pgaio_io_was_recycled().
*/
+ pg_read_barrier();
pgaio_io_reclaim(ioh);
break;
}
pgaio_result_status_string(result.status),
result.id, result.error_data, result.result);
result = ce->cb->complete_shared(ioh, result, cb_data);
+
+ /* the callback should never transition to unknown */
+ Assert(result.status != PGAIO_RS_UNKNOWN);
}
ioh->distilled_result = result;
/* start with distilled result from shared callback */
result = ioh->distilled_result;
+ Assert(result.status != PGAIO_RS_UNKNOWN);
for (int i = ioh->num_callbacks; i > 0; i--)
{
pgaio_result_status_string(result.status),
result.id, result.error_data, result.result);
result = ce->cb->complete_local(ioh, result, cb_data);
+
+ /* the callback should never transition to unknown */
+ Assert(result.status != PGAIO_RS_UNKNOWN);
}
/*
int nwakeups = 0;
int worker;
- /* Try to get a job to do. */
+ /*
+ * Try to get a job to do.
+ *
+ * The lwlock acquisition also provides the necessary memory barrier
+ * to ensure that we don't see an outdated data in the handle.
+ */
LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
{