From 7b1607e28b78001bceab135ca428feae6b4ef27c Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Thu, 7 Apr 2016 11:40:19 -0700 Subject: [PATCH 3/5] Cap the number of tapes used by external sorts Commit df700e6b set merge order based on available buffer space (the number of tapes was as high as possible while still allowing at least 32 * BLCKSZ buffer space per tape), rejecting Knuth's theoretically justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P), improving performance when the sort thereby completed in one pass. However, it's still true that there are unlikely to be benefits from increasing the number of tapes past 7 once the amount of data to be sorted significantly exceeds available memory; that commit probably mostly just improved matters where it enabled all merging to be done in a final on-the-fly merge. One problem with the merge order logic established by that commit is that with large work_mem settings and data volumes, the tapes previously wasted as much as 8% of the available memory budget; tens of thousands of tapes could be logically allocated for a sort that will only benefit from a few dozen. A new quasi-arbitrary cap of 501 is applied on the number of tapes that tuplesort will ever use (i.e. merge order is capped at 500 inclusive). This is a conservative estimate of the number of runs at which doing all merging on-the-fly no longer allows greater overlapping of I/O and computation. --- src/backend/utils/sort/tuplesort.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 9313b87..ef9de01 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -108,7 +108,7 @@ * code we determine the number of tapes M on the basis of workMem: we want * workMem/M to be large enough that we read a fair amount of data each time * we preread from a tape, so as to maintain the locality of access described - * above. Nonetheless, with large workMem we can have many tapes. + * above. Nonetheless, with large workMem we can have a few hundred tapes. * * * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group @@ -230,6 +230,7 @@ typedef enum * tape during a preread cycle (see discussion at top of file). */ #define MINORDER 6 /* minimum merge order */ +#define MAXORDER 500 /* maximum merge order */ #define TAPE_BUFFER_OVERHEAD (BLCKSZ * 3) #define MERGE_BUFFER_SIZE (BLCKSZ * 32) @@ -2250,8 +2251,22 @@ tuplesort_merge_order(int64 allowedMem) mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) / (MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD); - /* Even in minimum memory, use at least a MINORDER merge */ + /* + * Even in minimum memory, use at least a MINORDER merge. Also cap the + * maximum merge order to MAXORDER. + * + * When allowedMem is significantly lower than what is required for an + * internal sort, it is unlikely that there are benefits to increasing the + * number of tapes beyond Knuth's "sweet spot" of 7. Furthermore, in the + * common case where there turns out to be less than MAXORDER initial runs, + * the overhead per-tape if we were not to cap could be significant with + * high allowedMem. Significantly more tapes can be useful if they can + * enable doing all merging on-the-fly, but the merge heap is rather cache + * inefficient if there are too many tapes (with one run); multiple passes + * seem preferable. + */ mOrder = Max(mOrder, MINORDER); + mOrder = Min(mOrder, MAXORDER); return mOrder; } -- 1.9.1