Make locale-dependent regex character classes work for large char codes.

author Tom Lane

Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)

committer Tom Lane

Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)
author Tom Lane
Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)
committer Tom Lane
Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)
diff --git a/src/backend/regex/README b/src/backend/regex/README

index 6c9f48315e3a2be030288d98db53427b6016db78..f08aab69e376f6e7bc0df998d2fcd20fde4a45ef 100644 (file)
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -27,13 +27,14 @@ and similarly additional source files rege_*.c that are #include'd in
  regexec.  This was done to avoid exposing internal symbols globally;
  all functions not meant to be part of the library API are static.
  
-(Actually the above is a lie in one respect: there is one more global
-symbol, pg_set_regex_collation in regcomp.  It is not meant to be part of
-the API, but it has to be global because both regcomp and regexec call it.
-It'd be better to get rid of that, as well as the static variables it
-sets, in favor of keeping the needed locale state in the regex structs.
-We have not done this yet for lack of a design for how to add
-application-specific state to the structs.)
+(Actually the above is a lie in one respect: there are two more global
+symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp.  These are
+not meant to be part of the API, but they have to be global because both
+regcomp and regexec call them.  It'd be better to get rid of
+pg_set_regex_collation, as well as the static variables it sets, in favor of
+keeping the needed locale state in the regex structs.  We have not done this
+yet for lack of a design for how to add application-specific state to the
+structs.)
  
  What's where in src/backend/regex/:
  
@@ -274,28 +275,65 @@ colors:
    an existing color has to be subdivided.
  
  The last two of these are handled with the "struct colordesc" array and
-the "colorchain" links in NFA arc structs.  The color map proper (that
-is, the per-character lookup array) is handled as a multi-level tree,
-with each tree level indexed by one byte of a character's value.  The
-code arranges to not have more than one copy of bottom-level tree pages
-that are all-the-same-color.
-
-Unfortunately, this design does not seem terribly efficient for common
-cases such as a tree in which all Unicode letters are colored the same,
-because there aren't that many places where we get a whole page all the
-same color, except at the end of the map.  (It also strikes me that given
-PG's current restrictions on the range of Unicode values, we could use a
-3-level rather than 4-level tree; but there's not provision for that in
-regguts.h at the moment.)
-
-A bigger problem is that it just doesn't seem very reasonable to have to
-consider each Unicode letter separately at regex parse time for a regex
-such as "\w"; more than likely, a huge percentage of those codes will
-never be seen at runtime.  We need to fix things so that locale-based
-character classes are somehow processed "symbolically" without making a
-full expansion of their contents at parse time.  This would mean that we'd
-have to be ready to call iswalpha() at runtime, but if that only happens
-for high-code-value characters, it shouldn't be a big performance hit.
+the "colorchain" links in NFA arc structs.
+
+Ideally, we'd do the first two operations using a simple linear array
+storing the current color assignment for each character code.
+Unfortunately, that's not terribly workable for large charsets such as
+Unicode.  Our solution is to divide the color map into two parts.  A simple
+linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
+chosen large enough to include all popular characters (so that the
+significantly-slower code paths about to be described are seldom invoked).
+Characters above that need be considered at compile time only if they
+appear explicitly in the regex pattern.  We store each such mentioned
+character or character range as an entry in the "colormaprange" array in
+the colormap.  (Overlapping ranges are split into unique subranges, so that
+each range in the finished list needs only a single color that describes
+all its characters.)  When mapping a character above MAX_SIMPLE_CHR to a
+color at runtime, we search this list of ranges explicitly.
+
+That's still not quite enough, though, because of locale-dependent
+character classes such as [[:alpha:]].  In Unicode locales these classes
+may have thousands of entries that are above MAX_SIMPLE_CHR, and we
+certainly don't want to be searching large colormaprange arrays at runtime.
+Nor do we even want to spend the time to initialize cvec structures that
+exhaustively describe all of those characters.  Our solution is to compute
+exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
+For characters above that, we apply the  or  lookup
+functions at runtime for each locale-dependent character class used in the
+regex pattern, constructing a bitmap that describes which classes the
+runtime character belongs to.  The per-character-range data structure
+mentioned above actually holds, for each range, a separate color entry
+for each possible combination of character class properties.  That is,
+the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
+whose rows correspond to high characters or character ranges that are
+explicitly mentioned in the regex pattern, and whose columns correspond
+to sets of the locale-dependent character classes that are used in the
+regex.
+
+As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
+(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
+a high color map with three rows.  One row is for the single character
+U+1234 (represented as a single-element range), one is for the range
+U+1D100..U+1D1FF, and the other row represents all remaining high
+characters.  The color map has two columns, one for characters that
+satisfy iswalnum() and one for those that don't.
+
+We build this color map in parallel with scanning the regex.  Each time
+we detect a new explicit high character (or range) or a locale-dependent
+character class, we split existing entry(s) in the high color map so that
+characters we need to be able to distinguish will have distinct entries
+that can be given separate colors.  Often, though, single entries in the
+high color map will represent very large sets of characters.
+
+If there are both explicit high characters/ranges and locale-dependent
+character classes, we may have entries in the high color map array that
+have non-WHITE colors but don't actually represent any real characters.
+(For example, in a row representing a singleton range, only one of the
+columns could possibly be a live entry; it's the one matching the actual
+locale properties for that single character.)  We don't currently make
+any effort to reclaim such colors.  In principle it could be done, but
+it's not clear that it's worth the trouble.
  
  
  Detailed semantics of an NFA
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c

index 8ffc8fb797d0e5340e4c968be2eeabb9af66250b..4b72894ad3a60bc33c490f2fc78bccdda6369160 100644 (file)
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -49,10 +49,6 @@ static void
  initcm(struct vars * v,
        struct colormap * cm)
  {
-   int         i;
-   int         j;
-   union tree *t;
-   union tree *nextt;
     struct colordesc *cd;
  
     cm->magic = CMMAGIC;
@@ -64,24 +60,40 @@ initcm(struct vars * v,
     cm->free = 0;
  
     cd = cm->cd;                /* cm->cd[WHITE] */
+   cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1;
+   cd->nuchrs = 1;
     cd->sub = NOSUB;
     cd->arcs = NULL;
     cd->firstchr = CHR_MIN;
-   cd->nchrs = CHR_MAX - CHR_MIN + 1;
     cd->flags = 0;
  
-   /* upper levels of tree */
-   for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--)
+   cm->locolormap = (color *)
+       MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
+   if (cm->locolormap == NULL)
     {
-       nextt = t + 1;
-       for (i = BYTTAB - 1; i >= 0; i--)
-           t->tptr[i] = nextt;
+       CERR(REG_ESPACE);
+       cm->cmranges = NULL;    /* prevent failure during freecm */
+       cm->hicolormap = NULL;
+       return;
     }
-   /* bottom level is solid white */
-   t = &cm->tree[NBYTS - 1];
-   for (i = BYTTAB - 1; i >= 0; i--)
-       t->tcolor[i] = WHITE;
-   cd->block = t;
+   /* this memset relies on WHITE being zero: */
+   memset(cm->locolormap, WHITE,
+          (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
+
+   memset(cm->classbits, 0, sizeof(cm->classbits));
+   cm->numcmranges = 0;
+   cm->cmranges = NULL;
+   cm->maxarrayrows = 4;       /* arbitrary initial allocation */
+   cm->hiarrayrows = 1;        /* but we have only one row/col initially */
+   cm->hiarraycols = 1;
+   cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
+   if (cm->hicolormap == NULL)
+   {
+       CERR(REG_ESPACE);
+       return;
+   }
+   /* initialize the "all other characters" row to WHITE */
+   cm->hicolormap[0] = WHITE;
  }
  
  /*
@@ -90,117 +102,67 @@ initcm(struct vars * v,
  static void
  freecm(struct colormap * cm)
  {
-   size_t      i;
-   union tree *cb;
-
     cm->magic = 0;
-   if (NBYTS > 1)
-       cmtreefree(cm, cm->tree, 0);
-   for (i = 1; i <= cm->max; i++)      /* skip WHITE */
-       if (!UNUSEDCOLOR(&cm->cd[i]))
-       {
-           cb = cm->cd[i].block;
-           if (cb != NULL)
-               FREE(cb);
-       }
     if (cm->cd != cm->cdspace)
         FREE(cm->cd);
+   if (cm->locolormap != NULL)
+       FREE(cm->locolormap);
+   if (cm->cmranges != NULL)
+       FREE(cm->cmranges);
+   if (cm->hicolormap != NULL)
+       FREE(cm->hicolormap);
  }
  
  /*
- * cmtreefree - free a non-terminal part of a colormap tree
+ * pg_reg_getcolor - slow case of GETCOLOR()
   */
-static void
-cmtreefree(struct colormap * cm,
-          union tree * tree,
-          int level)           /* level number (top == 0) of this block */
+color
+pg_reg_getcolor(struct colormap * cm, chr c)
  {
-   int         i;
-   union tree *t;
-   union tree *fillt = &cm->tree[level + 1];
-   union tree *cb;
-
-   assert(level < NBYTS - 1);  /* this level has pointers */
-   for (i = BYTTAB - 1; i >= 0; i--)
+   int         rownum,
+               colnum,
+               low,
+               high;
+
+   /* Should not be used for chrs in the locolormap */
+   assert(c > MAX_SIMPLE_CHR);
+
+   /*
+    * Find which row it's in.  The colormapranges are in order, so we can use
+    * binary search.
+    */
+   rownum = 0;                 /* if no match, use array row zero */
+   low = 0;
+   high = cm->numcmranges;
+   while (low < high)
     {
-       t = tree->tptr[i];
-       assert(t != NULL);
-       if (t != fillt)
+       int         middle = low + (high - low) / 2;
+       const colormaprange *cmr = &cm->cmranges[middle];
+
+       if (c < cmr->cmin)
+           high = middle;
+       else if (c > cmr->cmax)
+           low = middle + 1;
+       else
         {
-           if (level < NBYTS - 2)
-           {                   /* more pointer blocks below */
-               cmtreefree(cm, t, level + 1);
-               FREE(t);
-           }
-           else
-           {                   /* color block below */
-               cb = cm->cd[t->tcolor[0]].block;
-               if (t != cb)    /* not a solid block */
-                   FREE(t);
-           }
+           rownum = cmr->rownum;       /* found a match */
+           break;
         }
     }
-}
  
-/*
- * setcolor - set the color of a character in a colormap
- */
-static color                   /* previous color */
-setcolor(struct colormap * cm,
-        chr c,
-        color co)
-{
-   uchr        uc = c;
-   int         shift;
-   int         level;
-   int         b;
-   int         bottom;
-   union tree *t;
-   union tree *newt;
-   union tree *fillt;
-   union tree *lastt;
-   union tree *cb;
-   color       prev;
-
-   assert(cm->magic == CMMAGIC);
-   if (CISERR() || co == COLORLESS)
-       return COLORLESS;
-
-   t = cm->tree;
-   for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
-        level++, shift -= BYTBITS)
-   {
-       b = (uc >> shift) & BYTMASK;
-       lastt = t;
-       t = lastt->tptr[b];
-       assert(t != NULL);
-       fillt = &cm->tree[level + 1];
-       bottom = (shift <= BYTBITS) ? 1 : 0;
-       cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt;
-       if (t == fillt || t == cb)
-       {                       /* must allocate a new block */
-           newt = (union tree *) MALLOC((bottom) ?
-                               sizeof(struct colors) : sizeof(struct ptrs));
-           if (newt == NULL)
-           {
-               CERR(REG_ESPACE);
-               return COLORLESS;
-           }
-           if (bottom)
-               memcpy(VS(newt->tcolor), VS(t->tcolor),
-                      BYTTAB * sizeof(color));
-           else
-               memcpy(VS(newt->tptr), VS(t->tptr),
-                      BYTTAB * sizeof(union tree *));
-           t = newt;
-           lastt->tptr[b] = t;
-       }
+   /*
+    * Find which column it's in --- this is all locale-dependent.
+    */
+   if (cm->hiarraycols > 1)
+   {
+       colnum = cclass_column_index(cm, c);
+       return cm->hicolormap[rownum * cm->hiarraycols + colnum];
+   }
+   else
+   {
+       /* fast path if no relevant cclasses */
+       return cm->hicolormap[rownum];
     }
-
-   b = uc & BYTMASK;
-   prev = t->tcolor[b];
-   t->tcolor[b] = co;
-   return prev;
  }
  
  /*
@@ -216,7 +178,7 @@ maxcolor(struct colormap * cm)
  }
  
  /*
- * newcolor - find a new color (must be subject of setcolor at once)
+ * newcolor - find a new color (must be assigned at once)
   * Beware: may relocate the colordescs.
   */
  static color                   /* COLORLESS for error */
@@ -278,12 +240,12 @@ newcolor(struct colormap * cm)
         cd = &cm->cd[cm->max];
     }
  
-   cd->nchrs = 0;
+   cd->nschrs = 0;
+   cd->nuchrs = 0;
     cd->sub = NOSUB;
     cd->arcs = NULL;
     cd->firstchr = CHR_MIN;     /* in case never set otherwise */
     cd->flags = 0;
-   cd->block = NULL;
  
     return (color) (cd - cm->cd);
  }
@@ -305,13 +267,9 @@ freecolor(struct colormap * cm,
  
     assert(cd->arcs == NULL);
     assert(cd->sub == NOSUB);
-   assert(cd->nchrs == 0);
+   assert(cd->nschrs == 0);
+   assert(cd->nuchrs == 0);
     cd->flags = FREECOL;
-   if (cd->block != NULL)
-   {
-       FREE(cd->block);
-       cd->block = NULL;       /* just paranoia */
-   }
  
     if ((size_t) co == cm->max)
     {
@@ -354,17 +312,25 @@ static color
  pseudocolor(struct colormap * cm)
  {
     color       co;
+   struct colordesc *cd;
  
     co = newcolor(cm);
     if (CISERR())
         return COLORLESS;
-   cm->cd[co].nchrs = 1;
-   cm->cd[co].flags = PSEUDO;
+   cd = &cm->cd[co];
+   cd->nschrs = 0;
+   cd->nuchrs = 1;             /* pretend it is in the upper map */
+   cd->sub = NOSUB;
+   cd->arcs = NULL;
+   cd->firstchr = CHR_MIN;
+   cd->flags = PSEUDO;
     return co;
  }
  
  /*
   * subcolor - allocate a new subcolor (if necessary) to this chr
+ *
+ * This works only for chrs that map into the low color map.
   */
  static color
  subcolor(struct colormap * cm, chr c)
@@ -372,7 +338,9 @@ subcolor(struct colormap * cm, chr c)
     color       co;             /* current color of c */
     color       sco;            /* new subcolor */
  
-   co = GETCOLOR(cm, c);
+   assert(c <= MAX_SIMPLE_CHR);
+
+   co = cm->locolormap[c - CHR_MIN];
     sco = newsub(cm, co);
     if (CISERR())
         return COLORLESS;
@@ -380,11 +348,37 @@ subcolor(struct colormap * cm, chr c)
  
     if (co == sco)              /* already in an open subcolor */
         return co;              /* rest is redundant */
-   cm->cd[co].nchrs--;
-   if (cm->cd[sco].nchrs == 0)
+   cm->cd[co].nschrs--;
+   if (cm->cd[sco].nschrs == 0)
         cm->cd[sco].firstchr = c;
-   cm->cd[sco].nchrs++;
-   setcolor(cm, c, sco);
+   cm->cd[sco].nschrs++;
+   cm->locolormap[c - CHR_MIN] = sco;
+   return sco;
+}
+
+/*
+ * subcolorhi - allocate a new subcolor (if necessary) to this colormap entry
+ *
+ * This is the same processing as subcolor(), but for entries in the high
+ * colormap, which do not necessarily correspond to exactly one chr code.
+ */
+static color
+subcolorhi(struct colormap * cm, color *pco)
+{
+   color       co;             /* current color of entry */
+   color       sco;            /* new subcolor */
+
+   co = *pco;
+   sco = newsub(cm, co);
+   if (CISERR())
+       return COLORLESS;
+   assert(sco != COLORLESS);
+
+   if (co == sco)              /* already in an open subcolor */
+       return co;              /* rest is redundant */
+   cm->cd[co].nuchrs--;
+   cm->cd[sco].nuchrs++;
+   *pco = sco;
     return sco;
  }
  
@@ -400,7 +394,8 @@ newsub(struct colormap * cm,
     sco = cm->cd[co].sub;
     if (sco == NOSUB)
     {                           /* color has no open subcolor */
-       if (cm->cd[co].nchrs == 1)      /* optimization */
+       /* optimization: singly-referenced color need not be subcolored */
+       if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1)
             return co;
         sco = newcolor(cm);     /* must create subcolor */
         if (sco == COLORLESS)
@@ -417,136 +412,500 @@ newsub(struct colormap * cm,
  }
  
  /*
- * subrange - allocate new subcolors to this range of chrs, fill in arcs
+ * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow
+ *
+ * Returns array index of new row.  Note the array might move.
   */
-static void
-subrange(struct vars * v,
-        chr from,
-        chr to,
-        struct state * lp,
-        struct state * rp)
+static int
+newhicolorrow(struct colormap * cm,
+             int oldrow)
  {
-   uchr        uf;
+   int         newrow = cm->hiarrayrows;
+   color      *newrowptr;
     int         i;
  
-   assert(from <= to);
+   /* Assign a fresh array row index, enlarging storage if needed */
+   if (newrow >= cm->maxarrayrows)
+   {
+       color      *newarray;
+
+       if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2))
+       {
+           CERR(REG_ESPACE);
+           return 0;
+       }
+       newarray = (color *) REALLOC(cm->hicolormap,
+                                    cm->maxarrayrows * 2 *
+                                    cm->hiarraycols * sizeof(color));
+       if (newarray == NULL)
+       {
+           CERR(REG_ESPACE);
+           return 0;
+       }
+       cm->hicolormap = newarray;
+       cm->maxarrayrows *= 2;
+   }
+   cm->hiarrayrows++;
+
+   /* Copy old row data */
+   newrowptr = &cm->hicolormap[newrow * cm->hiarraycols];
+   memcpy(newrowptr,
+          &cm->hicolormap[oldrow * cm->hiarraycols],
+          cm->hiarraycols * sizeof(color));
+
+   /* Increase color reference counts to reflect new colormap entries */
+   for (i = 0; i < cm->hiarraycols; i++)
+       cm->cd[newrowptr[i]].nuchrs++;
  
-   /* first, align "from" on a tree-block boundary */
-   uf = (uchr) from;
-   i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf);
-   for (; from <= to && i > 0; i--, from++)
-       newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
-   if (from > to)              /* didn't reach a boundary */
+   return newrow;
+}
+
+/*
+ * newhicolorcols - create a new set of columns in the high colormap
+ *
+ * Essentially, extends the 2-D array to the right with a copy of itself.
+ */
+static void
+newhicolorcols(struct colormap * cm)
+{
+   color      *newarray;
+   int         r,
+               c;
+
+   if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2))
+   {
+       CERR(REG_ESPACE);
         return;
+   }
+   newarray = (color *) REALLOC(cm->hicolormap,
+                                cm->maxarrayrows *
+                                cm->hiarraycols * 2 * sizeof(color));
+   if (newarray == NULL)
+   {
+       CERR(REG_ESPACE);
+       return;
+   }
+   cm->hicolormap = newarray;
  
-   /* deal with whole blocks */
-   for (; to - from >= BYTTAB; from += BYTTAB)
-       subblock(v, from, lp, rp);
+   /* Duplicate existing columns to the right, and increase ref counts */
+   /* Must work backwards in the array because we realloc'd in place */
+   for (r = cm->hiarrayrows - 1; r >= 0; r--)
+   {
+       color      *oldrowptr = &newarray[r * cm->hiarraycols];
+       color      *newrowptr = &newarray[r * cm->hiarraycols * 2];
+       color      *newrowptr2 = newrowptr + cm->hiarraycols;
+
+       for (c = 0; c < cm->hiarraycols; c++)
+       {
+           color       co = oldrowptr[c];
  
-   /* clean up any remaining partial table */
-   for (; from <= to; from++)
-       newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
+           newrowptr[c] = newrowptr2[c] = co;
+           cm->cd[co].nuchrs++;
+       }
+   }
+
+   cm->hiarraycols *= 2;
  }
  
  /*
- * subblock - allocate new subcolors for one tree block of chrs, fill in arcs
+ * subcolorcvec - allocate new subcolors to cvec members, fill in arcs
   *
- * Note: subcolors that are created during execution of this function
- * will not be given a useful value of firstchr; it'll be left as CHR_MIN.
- * For the current usage of firstchr in pg_regprefix, this does not matter
- * because such subcolors won't occur in the common prefix of a regex.
+ * For each chr "c" represented by the cvec, do the equivalent of
+ * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
+ *
+ * Note that in typical cases, many of the subcolors are the same.
+ * While newarc() would discard duplicate arc requests, we can save
+ * some cycles by not calling it repetitively to begin with.  This is
+ * mechanized with the "lastsubcolor" state variable.
   */
  static void
-subblock(struct vars * v,
-        chr start,             /* first of BYTTAB chrs */
-        struct state * lp,
-        struct state * rp)
+subcolorcvec(struct vars * v,
+            struct cvec * cv,
+            struct state * lp,
+            struct state * rp)
  {
-   uchr        uc = start;
     struct colormap *cm = v->cm;
-   int         shift;
-   int         level;
+   color       lastsubcolor = COLORLESS;
+   chr         ch,
+               from,
+               to;
+   const chr  *p;
     int         i;
-   int         b;
-   union tree *t;
-   union tree *cb;
-   union tree *fillt;
-   union tree *lastt;
-   int         previ;
-   int         ndone;
-   color       co;
-   color       sco;
  
-   assert((uc % BYTTAB) == 0);
-
-   /* find its color block, making new pointer blocks as needed */
-   t = cm->tree;
-   fillt = NULL;
-   for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
-        level++, shift -= BYTBITS)
-   {
-       b = (uc >> shift) & BYTMASK;
-       lastt = t;
-       t = lastt->tptr[b];
-       assert(t != NULL);
-       fillt = &cm->tree[level + 1];
-       if (t == fillt && shift > BYTBITS)
-       {                       /* need new ptr block */
-           t = (union tree *) MALLOC(sizeof(struct ptrs));
-           if (t == NULL)
+   /* ordinary characters */
+   for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
+   {
+       ch = *p;
+       subcoloronechr(v, ch, lp, rp, &lastsubcolor);
+       NOERR();
+   }
+
+   /* and the ranges */
+   for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
+   {
+       from = *p;
+       to = *(p + 1);
+       if (from <= MAX_SIMPLE_CHR)
+       {
+           /* deal with simple chars one at a time */
+           chr         lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR;
+
+           while (from <= lim)
             {
-               CERR(REG_ESPACE);
-               return;
+               color       sco = subcolor(cm, from);
+
+               NOERR();
+               if (sco != lastsubcolor)
+               {
+                   newarc(v->nfa, PLAIN, sco, lp, rp);
+                   NOERR();
+                   lastsubcolor = sco;
+               }
+               from++;
             }
-           memcpy(VS(t->tptr), VS(fillt->tptr),
-                  BYTTAB * sizeof(union tree *));
-           lastt->tptr[b] = t;
         }
+       /* deal with any part of the range that's above MAX_SIMPLE_CHR */
+       if (from < to)
+           subcoloronerange(v, from, to, lp, rp, &lastsubcolor);
+       else if (from == to)
+           subcoloronechr(v, from, lp, rp, &lastsubcolor);
+       NOERR();
     }
  
-   /* special cases:  fill block or solid block */
-   co = t->tcolor[0];
-   cb = cm->cd[co].block;
-   if (t == fillt || t == cb)
+   /* and deal with cclass if any */
+   if (cv->cclasscode >= 0)
     {
-       /* either way, we want a subcolor solid block */
-       sco = newsub(cm, co);
-       t = cm->cd[sco].block;
-       if (t == NULL)
-       {                       /* must set it up */
-           t = (union tree *) MALLOC(sizeof(struct colors));
-           if (t == NULL)
+       int         classbit;
+       color      *pco;
+       int         r,
+                   c;
+
+       /* Enlarge array if we don't have a column bit assignment for cclass */
+       if (cm->classbits[cv->cclasscode] == 0)
+       {
+           cm->classbits[cv->cclasscode] = cm->hiarraycols;
+           newhicolorcols(cm);
+           NOERR();
+       }
+       /* Apply subcolorhi() and make arc for each entry in relevant cols */
+       classbit = cm->classbits[cv->cclasscode];
+       pco = cm->hicolormap;
+       for (r = 0; r < cm->hiarrayrows; r++)
+       {
+           for (c = 0; c < cm->hiarraycols; c++)
             {
-               CERR(REG_ESPACE);
-               return;
+               if (c & classbit)
+               {
+                   color       sco = subcolorhi(cm, pco);
+
+                   NOERR();
+                   /* add the arc if needed */
+                   if (sco != lastsubcolor)
+                   {
+                       newarc(v->nfa, PLAIN, sco, lp, rp);
+                       NOERR();
+                       lastsubcolor = sco;
+                   }
+               }
+               pco++;
             }
-           for (i = 0; i < BYTTAB; i++)
-               t->tcolor[i] = sco;
-           cm->cd[sco].block = t;
         }
-       /* find loop must have run at least once */
-       lastt->tptr[b] = t;
-       newarc(v->nfa, PLAIN, sco, lp, rp);
-       cm->cd[co].nchrs -= BYTTAB;
-       cm->cd[sco].nchrs += BYTTAB;
+   }
+}
+
+/*
+ * subcoloronechr - do subcolorcvec's work for a singleton chr
+ *
+ * We could just let subcoloronerange do this, but it's a bit more efficient
+ * if we exploit the single-chr case.  Also, callers find it useful for this
+ * to be able to handle both low and high chr codes.
+ */
+static void
+subcoloronechr(struct vars * v,
+              chr ch,
+              struct state * lp,
+              struct state * rp,
+              color *lastsubcolor)
+{
+   struct colormap *cm = v->cm;
+   colormaprange *newranges;
+   int         numnewranges;
+   colormaprange *oldrange;
+   int         oldrangen;
+   int         newrow;
+
+   /* Easy case for low chr codes */
+   if (ch <= MAX_SIMPLE_CHR)
+   {
+       color       sco = subcolor(cm, ch);
+
+       NOERR();
+       if (sco != *lastsubcolor)
+       {
+           newarc(v->nfa, PLAIN, sco, lp, rp);
+           *lastsubcolor = sco;
+       }
+       return;
+   }
+
+   /*
+    * Potentially, we could need two more colormapranges than we have now, if
+    * the given chr is in the middle of some existing range.
+    */
+   newranges = (colormaprange *)
+       MALLOC((cm->numcmranges + 2) * sizeof(colormaprange));
+   if (newranges == NULL)
+   {
+       CERR(REG_ESPACE);
         return;
     }
+   numnewranges = 0;
  
-   /* general case, a mixed block to be altered */
-   i = 0;
-   while (i < BYTTAB)
+   /* Ranges before target are unchanged */
+   for (oldrange = cm->cmranges, oldrangen = 0;
+        oldrangen < cm->numcmranges;
+        oldrange++, oldrangen++)
+   {
+       if (oldrange->cmax >= ch)
+           break;
+       newranges[numnewranges++] = *oldrange;
+   }
+
+   /* Match target chr against current range */
+   if (oldrangen >= cm->numcmranges || oldrange->cmin > ch)
+   {
+       /* chr does not belong to any existing range, make a new one */
+       newranges[numnewranges].cmin = ch;
+       newranges[numnewranges].cmax = ch;
+       /* row state should be cloned from the "all others" row */
+       newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+       numnewranges++;
+   }
+   else if (oldrange->cmin == oldrange->cmax)
+   {
+       /* we have an existing singleton range matching the chr */
+       newranges[numnewranges++] = *oldrange;
+       newrow = oldrange->rownum;
+       /* we've now fully processed this old range */
+       oldrange++, oldrangen++;
+   }
+   else
     {
-       co = t->tcolor[i];
-       sco = newsub(cm, co);
-       newarc(v->nfa, PLAIN, sco, lp, rp);
-       previ = i;
-       do
+       /* chr is a subset of this existing range, must split it */
+       if (ch > oldrange->cmin)
+       {
+           /* emit portion of old range before chr */
+           newranges[numnewranges].cmin = oldrange->cmin;
+           newranges[numnewranges].cmax = ch - 1;
+           newranges[numnewranges].rownum = oldrange->rownum;
+           numnewranges++;
+       }
+       /* emit chr as singleton range, initially cloning from range */
+       newranges[numnewranges].cmin = ch;
+       newranges[numnewranges].cmax = ch;
+       newranges[numnewranges].rownum = newrow =
+           newhicolorrow(cm, oldrange->rownum);
+       numnewranges++;
+       if (ch < oldrange->cmax)
         {
-           t->tcolor[i++] = sco;
-       } while (i < BYTTAB && t->tcolor[i] == co);
-       ndone = i - previ;
-       cm->cd[co].nchrs -= ndone;
-       cm->cd[sco].nchrs += ndone;
+           /* emit portion of old range after chr */
+           newranges[numnewranges].cmin = ch + 1;
+           newranges[numnewranges].cmax = oldrange->cmax;
+           /* must clone the row if we are making two new ranges from old */
+           newranges[numnewranges].rownum =
+               (ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
+               oldrange->rownum;
+           numnewranges++;
+       }
+       /* we've now fully processed this old range */
+       oldrange++, oldrangen++;
+   }
+
+   /* Update colors in newrow and create arcs as needed */
+   subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+
+   /* Ranges after target are unchanged */
+   for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
+   {
+       newranges[numnewranges++] = *oldrange;
+   }
+
+   /* Assert our original space estimate was adequate */
+   assert(numnewranges <= (cm->numcmranges + 2));
+
+   /* And finally, store back the updated list of ranges */
+   if (cm->cmranges != NULL)
+       FREE(cm->cmranges);
+   cm->cmranges = newranges;
+   cm->numcmranges = numnewranges;
+}
+
+/*
+ * subcoloronerange - do subcolorcvec's work for a high range
+ */
+static void
+subcoloronerange(struct vars * v,
+                chr from,
+                chr to,
+                struct state * lp,
+                struct state * rp,
+                color *lastsubcolor)
+{
+   struct colormap *cm = v->cm;
+   colormaprange *newranges;
+   int         numnewranges;
+   colormaprange *oldrange;
+   int         oldrangen;
+   int         newrow;
+
+   /* Caller should take care of non-high-range cases */
+   assert(from > MAX_SIMPLE_CHR);
+   assert(from < to);
+
+   /*
+    * Potentially, if we have N non-adjacent ranges, we could need as many as
+    * 2N+1 result ranges (consider case where new range spans 'em all).
+    */
+   newranges = (colormaprange *)
+       MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange));
+   if (newranges == NULL)
+   {
+       CERR(REG_ESPACE);
+       return;
+   }
+   numnewranges = 0;
+
+   /* Ranges before target are unchanged */
+   for (oldrange = cm->cmranges, oldrangen = 0;
+        oldrangen < cm->numcmranges;
+        oldrange++, oldrangen++)
+   {
+       if (oldrange->cmax >= from)
+           break;
+       newranges[numnewranges++] = *oldrange;
+   }
+
+   /*
+    * Deal with ranges that (partially) overlap the target.  As we process
+    * each such range, increase "from" to remove the dealt-with characters
+    * from the target range.
+    */
+   while (oldrangen < cm->numcmranges && oldrange->cmin <= to)
+   {
+       if (from < oldrange->cmin)
+       {
+           /* Handle portion of new range that corresponds to no old range */
+           newranges[numnewranges].cmin = from;
+           newranges[numnewranges].cmax = oldrange->cmin - 1;
+           /* row state should be cloned from the "all others" row */
+           newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+           numnewranges++;
+           /* Update colors in newrow and create arcs as needed */
+           subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+           /* We've now fully processed the part of new range before old */
+           from = oldrange->cmin;
+       }
+
+       if (from <= oldrange->cmin && to >= oldrange->cmax)
+       {
+           /* old range is fully contained in new, process it in-place */
+           newranges[numnewranges++] = *oldrange;
+           newrow = oldrange->rownum;
+           from = oldrange->cmax + 1;
+       }
+       else
+       {
+           /* some part of old range does not overlap new range */
+           if (from > oldrange->cmin)
+           {
+               /* emit portion of old range before new range */
+               newranges[numnewranges].cmin = oldrange->cmin;
+               newranges[numnewranges].cmax = from - 1;
+               newranges[numnewranges].rownum = oldrange->rownum;
+               numnewranges++;
+           }
+           /* emit common subrange, initially cloning from old range */
+           newranges[numnewranges].cmin = from;
+           newranges[numnewranges].cmax =
+               (to < oldrange->cmax) ? to : oldrange->cmax;
+           newranges[numnewranges].rownum = newrow =
+               newhicolorrow(cm, oldrange->rownum);
+           numnewranges++;
+           if (to < oldrange->cmax)
+           {
+               /* emit portion of old range after new range */
+               newranges[numnewranges].cmin = to + 1;
+               newranges[numnewranges].cmax = oldrange->cmax;
+               /* must clone the row if we are making two new ranges from old */
+               newranges[numnewranges].rownum =
+                   (from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
+                   oldrange->rownum;
+               numnewranges++;
+           }
+           from = oldrange->cmax + 1;
+       }
+       /* Update colors in newrow and create arcs as needed */
+       subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+       /* we've now fully processed this old range */
+       oldrange++, oldrangen++;
+   }
+
+   if (from <= to)
+   {
+       /* Handle portion of new range that corresponds to no old range */
+       newranges[numnewranges].cmin = from;
+       newranges[numnewranges].cmax = to;
+       /* row state should be cloned from the "all others" row */
+       newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+       numnewranges++;
+       /* Update colors in newrow and create arcs as needed */
+       subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+   }
+
+   /* Ranges after target are unchanged */
+   for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
+   {
+       newranges[numnewranges++] = *oldrange;
+   }
+
+   /* Assert our original space estimate was adequate */
+   assert(numnewranges <= (cm->numcmranges * 2 + 1));
+
+   /* And finally, store back the updated list of ranges */
+   if (cm->cmranges != NULL)
+       FREE(cm->cmranges);
+   cm->cmranges = newranges;
+   cm->numcmranges = numnewranges;
+}
+
+/*
+ * subcoloronerow - do subcolorcvec's work for one new row in the high colormap
+ */
+static void
+subcoloronerow(struct vars * v,
+              int rownum,
+              struct state * lp,
+              struct state * rp,
+              color *lastsubcolor)
+{
+   struct colormap *cm = v->cm;
+   color      *pco;
+   int         i;
+
+   /* Apply subcolorhi() and make arc for each entry in row */
+   pco = &cm->hicolormap[rownum * cm->hiarraycols];
+   for (i = 0; i < cm->hiarraycols; pco++, i++)
+   {
+       color       sco = subcolorhi(cm, pco);
+
+       NOERR();
+       /* make the arc if needed */
+       if (sco != *lastsubcolor)
+       {
+           newarc(v->nfa, PLAIN, sco, lp, rp);
+           NOERR();
+           *lastsubcolor = sco;
+       }
     }
  }
  
@@ -575,12 +934,12 @@ okcolors(struct nfa * nfa,
         {
             /* is subcolor, let parent deal with it */
         }
-       else if (cd->nchrs == 0)
+       else if (cd->nschrs == 0 && cd->nuchrs == 0)
         {
             /* parent empty, its arcs change color to subcolor */
             cd->sub = NOSUB;
             scd = &cm->cd[sco];
-           assert(scd->nchrs > 0);
+           assert(scd->nschrs > 0 || scd->nuchrs > 0);
             assert(scd->sub == sco);
             scd->sub = NOSUB;
             while ((a = cd->arcs) != NULL)
@@ -597,7 +956,7 @@ okcolors(struct nfa * nfa,
             /* parent's arcs must gain parallel subcolor arcs */
             cd->sub = NOSUB;
             scd = &cm->cd[sco];
-           assert(scd->nchrs > 0);
+           assert(scd->nschrs > 0 || scd->nuchrs > 0);
             assert(scd->sub == sco);
             scd->sub = NOSUB;
             for (a = cd->arcs; a != NULL; a = a->colorchain)
@@ -711,62 +1070,54 @@ dumpcolors(struct colormap * cm,
     struct colordesc *end;
     color       co;
     chr         c;
-   char       *has;
  
     fprintf(f, "max %ld\n", (long) cm->max);
-   if (NBYTS > 1)
-       fillcheck(cm, cm->tree, 0, f);
     end = CDEND(cm);
     for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
+   {
         if (!UNUSEDCOLOR(cd))
         {
-           assert(cd->nchrs > 0);
-           has = (cd->block != NULL) ? "#" : "";
+           assert(cd->nschrs > 0 || cd->nuchrs > 0);
             if (cd->flags & PSEUDO)
-               fprintf(f, "#%2ld%s(ps): ", (long) co, has);
+               fprintf(f, "#%2ld(ps): ", (long) co);
             else
-               fprintf(f, "#%2ld%s(%2d): ", (long) co,
-                       has, cd->nchrs);
+               fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs);
  
             /*
              * Unfortunately, it's hard to do this next bit more efficiently.
-            *
-            * Spencer's original coding has the loop iterating from CHR_MIN
-            * to CHR_MAX, but that's utterly unusable for 32-bit chr. For
-            * debugging purposes it seems fine to print only chr codes up to
-            * 1000 or so.
              */
-           for (c = CHR_MIN; c < 1000; c++)
+           for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
                 if (GETCOLOR(cm, c) == co)
                     dumpchr(c, f);
             fprintf(f, "\n");
         }
-}
-
-/*
- * fillcheck - check proper filling of a tree
- */
-static void
-fillcheck(struct colormap * cm,
-         union tree * tree,
-         int level,            /* level number (top == 0) of this block */
-         FILE *f)
-{
-   int         i;
-   union tree *t;
-   union tree *fillt = &cm->tree[level + 1];
-
-   assert(level < NBYTS - 1);  /* this level has pointers */
-   for (i = BYTTAB - 1; i >= 0; i--)
+   }
+   /* dump the high colormap if it contains anything interesting */
+   if (cm->hiarrayrows > 1 || cm->hiarraycols > 1)
     {
-       t = tree->tptr[i];
-       if (t == NULL)
-           fprintf(f, "NULL found in filled tree!\n");
-       else if (t == fillt)
+       int         r,
+                   c;
+       const color *rowptr;
+
+       fprintf(f, "other:\t");
+       for (c = 0; c < cm->hiarraycols; c++)
+       {
+           fprintf(f, "\t%ld", (long) cm->hicolormap[c]);
+       }
+       fprintf(f, "\n");
+       for (r = 0; r < cm->numcmranges; r++)
         {
+           dumpchr(cm->cmranges[r].cmin, f);
+           fprintf(f, "..");
+           dumpchr(cm->cmranges[r].cmax, f);
+           fprintf(f, ":");
+           rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols];
+           for (c = 0; c < cm->hiarraycols; c++)
+           {
+               fprintf(f, "\t%ld", (long) rowptr[c]);
+           }
+           fprintf(f, "\n");
         }
-       else if (level < NBYTS - 2)     /* more pointer blocks below */
-           fillcheck(cm, t, level + 1, f);
     }
  }
  
diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c

index 3a9e8cfbbd7390b362cd3480dc1a6782da7a6a95..50b7a4574b4049ddfd2497d16c2bbf99ce0f1c64 100644 (file)
--- a/src/backend/regex/regc_cvec.c
+++ b/src/backend/regex/regc_cvec.c
@@ -34,7 +34,8 @@
  
  /*
   * Notes:
- * Only (selected) functions in _this_ file should treat chr* as non-constant.
+ * Only (selected) functions in _this_ file should treat the chr arrays
+ * of a cvec as non-constant.
   */
  
  /*
@@ -67,6 +68,7 @@ clearcvec(struct cvec * cv)
     assert(cv != NULL);
     cv->nchrs = 0;
     cv->nranges = 0;
+   cv->cclasscode = -1;
     return cv;
  }
  
diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c

index 399de027cdd655ec3ae0d5ebd28450421ad6b9aa..7cb3a40a0c88b678577ccb20b847641233037226 100644 (file)
--- a/src/backend/regex/regc_locale.c
+++ b/src/backend/regex/regc_locale.c
@@ -349,6 +349,19 @@ static const struct cname
     }
  };
  
+/*
+ * The following arrays define the valid character class names.
+ */
+static const char *const classNames[NUM_CCLASSES + 1] = {
+   "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
+   "lower", "print", "punct", "space", "upper", "xdigit", NULL
+};
+
+enum classes
+{
+   CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
+   CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
+};
  
  /*
   * We do not use the hard-wired Unicode classification tables that Tcl does.
@@ -543,21 +556,6 @@ cclass(struct vars * v,            /* context */
     int         i,
                 index;
  
-   /*
-    * The following arrays define the valid character class names.
-    */
-
-   static const char *const classNames[] = {
-       "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
-       "lower", "print", "punct", "space", "upper", "xdigit", NULL
-   };
-
-   enum classes
-   {
-       CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
-       CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
-   };
-
     /*
      * Map the name to the corresponding enumerated value.
      */
@@ -593,18 +591,20 @@ cclass(struct vars * v,           /* context */
      * pg_ctype_get_cache so that we can cache the results.  Other classes
      * have definitions that are hard-wired here, and for those we just
      * construct a transient cvec on the fly.
+    *
+    * NB: keep this code in sync with cclass_column_index(), below.
      */
  
     switch ((enum classes) index)
     {
         case CC_PRINT:
-           cv = pg_ctype_get_cache(pg_wc_isprint);
+           cv = pg_ctype_get_cache(pg_wc_isprint, index);
             break;
         case CC_ALNUM:
-           cv = pg_ctype_get_cache(pg_wc_isalnum);
+           cv = pg_ctype_get_cache(pg_wc_isalnum, index);
             break;
         case CC_ALPHA:
-           cv = pg_ctype_get_cache(pg_wc_isalpha);
+           cv = pg_ctype_get_cache(pg_wc_isalpha, index);
             break;
         case CC_ASCII:
             /* hard-wired meaning */
@@ -625,10 +625,10 @@ cclass(struct vars * v,           /* context */
             addrange(cv, 0x7f, 0x9f);
             break;
         case CC_DIGIT:
-           cv = pg_ctype_get_cache(pg_wc_isdigit);
+           cv = pg_ctype_get_cache(pg_wc_isdigit, index);
             break;
         case CC_PUNCT:
-           cv = pg_ctype_get_cache(pg_wc_ispunct);
+           cv = pg_ctype_get_cache(pg_wc_ispunct, index);
             break;
         case CC_XDIGIT:
  
@@ -646,16 +646,16 @@ cclass(struct vars * v,           /* context */
             }
             break;
         case CC_SPACE:
-           cv = pg_ctype_get_cache(pg_wc_isspace);
+           cv = pg_ctype_get_cache(pg_wc_isspace, index);
             break;
         case CC_LOWER:
-           cv = pg_ctype_get_cache(pg_wc_islower);
+           cv = pg_ctype_get_cache(pg_wc_islower, index);
             break;
         case CC_UPPER:
-           cv = pg_ctype_get_cache(pg_wc_isupper);
+           cv = pg_ctype_get_cache(pg_wc_isupper, index);
             break;
         case CC_GRAPH:
-           cv = pg_ctype_get_cache(pg_wc_isgraph);
+           cv = pg_ctype_get_cache(pg_wc_isgraph, index);
             break;
     }
  
@@ -665,6 +665,47 @@ cclass(struct vars * v,            /* context */
     return cv;
  }
  
+/*
+ * cclass_column_index - get appropriate high colormap column index for chr
+ */
+static int
+cclass_column_index(struct colormap * cm, chr c)
+{
+   int         colnum = 0;
+
+   /* Shouldn't go through all these pushups for simple chrs */
+   assert(c > MAX_SIMPLE_CHR);
+
+   /*
+    * Note: we should not see requests to consider cclasses that are not
+    * treated as locale-specific by cclass(), above.
+    */
+   if (cm->classbits[CC_PRINT] && pg_wc_isprint(c))
+       colnum |= cm->classbits[CC_PRINT];
+   if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c))
+       colnum |= cm->classbits[CC_ALNUM];
+   if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c))
+       colnum |= cm->classbits[CC_ALPHA];
+   assert(cm->classbits[CC_ASCII] == 0);
+   assert(cm->classbits[CC_BLANK] == 0);
+   assert(cm->classbits[CC_CNTRL] == 0);
+   if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c))
+       colnum |= cm->classbits[CC_DIGIT];
+   if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c))
+       colnum |= cm->classbits[CC_PUNCT];
+   assert(cm->classbits[CC_XDIGIT] == 0);
+   if (cm->classbits[CC_SPACE] && pg_wc_isspace(c))
+       colnum |= cm->classbits[CC_SPACE];
+   if (cm->classbits[CC_LOWER] && pg_wc_islower(c))
+       colnum |= cm->classbits[CC_LOWER];
+   if (cm->classbits[CC_UPPER] && pg_wc_isupper(c))
+       colnum |= cm->classbits[CC_UPPER];
+   if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c))
+       colnum |= cm->classbits[CC_GRAPH];
+
+   return colnum;
+}
+
  /*
   * allcases - supply cvec for all case counterparts of a chr (including itself)
   *
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c

index 551ae7dc08eed55e841a4a89ca60d2d01a34374d..ad9d4b1961d4398f135e11a777160d7754fbb59e 100644 (file)
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -736,7 +736,7 @@ store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs)
   * Note that the result must not be freed or modified by caller.
   */
  static struct cvec *
-pg_ctype_get_cache(pg_wc_probefunc probefunc)
+pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode)
  {
     pg_ctype_cache *pcc;
     pg_wchar    max_chr;
@@ -770,31 +770,43 @@ pg_ctype_get_cache(pg_wc_probefunc probefunc)
     pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
     if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
         goto out_of_memory;
+   pcc->cv.cclasscode = cclasscode;
  
     /*
-    * Decide how many character codes we ought to look through.  For C locale
-    * there's no need to go further than 127.  Otherwise, if the encoding is
-    * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
-    * extend it as far as we'd like (say, 0xFFFF, the end of the Basic
-    * Multilingual Plane) without creating significant performance issues due
-    * to too many characters being fed through the colormap code.  This will
-    * need redesign to fix reasonably, but at least for the moment we have
-    * all common European languages covered.  Otherwise (not C, not UTF8) go
-    * up to 255.  These limits are interrelated with restrictions discussed
-    * at the head of this file.
+    * Decide how many character codes we ought to look through.  In general
+    * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at
+    * runtime using the "high colormap" mechanism.  However, in C locale
+    * there's no need to go further than 127, and if we only have a 1-byte
+    *  API there's no need to go further than that can handle.
+    *
+    * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the
+    * output cvec as not having any locale-dependent behavior, since there
+    * will be no need to do any run-time locale checks.  (The #if's here
+    * would always be true for production values of MAX_SIMPLE_CHR, but it's
+    * useful to allow it to be small for testing purposes.)
      */
     switch (pg_regex_strategy)
     {
         case PG_REGEX_LOCALE_C:
+#if MAX_SIMPLE_CHR >= 127
             max_chr = (pg_wchar) 127;
+           pcc->cv.cclasscode = -1;
+#else
+           max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
             break;
         case PG_REGEX_LOCALE_WIDE:
         case PG_REGEX_LOCALE_WIDE_L:
-           max_chr = (pg_wchar) 0x7FF;
+           max_chr = (pg_wchar) MAX_SIMPLE_CHR;
             break;
         case PG_REGEX_LOCALE_1BYTE:
         case PG_REGEX_LOCALE_1BYTE_L:
+#if MAX_SIMPLE_CHR >= UCHAR_MAX
             max_chr = (pg_wchar) UCHAR_MAX;
+           pcc->cv.cclasscode = -1;
+#else
+           max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
             break;
         default:
             max_chr = 0;        /* can't get here, but keep compiler quiet */
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c

index b211cc0a189f7d6d06f5986ba85899219d0a01d0..ed95474b3f073fc10670bbc1646697714f91cfb8 100644 (file)
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -55,7 +55,6 @@ static void cbracket(struct vars *, struct state *, struct state *);
  static void brackpart(struct vars *, struct state *, struct state *);
  static const chr *scanplain(struct vars *);
  static void onechr(struct vars *, chr, struct state *, struct state *);
-static void dovec(struct vars *, struct cvec *, struct state *, struct state *);
  static void wordchrs(struct vars *);
  static void processlacon(struct vars *, struct state *, struct state *, int,
              struct state *, struct state *);
@@ -96,16 +95,19 @@ static chr  chrnamed(struct vars *, const chr *, const chr *, chr);
  /* === regc_color.c === */
  static void initcm(struct vars *, struct colormap *);
  static void freecm(struct colormap *);
-static void cmtreefree(struct colormap *, union tree *, int);
-static color setcolor(struct colormap *, chr, color);
  static color maxcolor(struct colormap *);
  static color newcolor(struct colormap *);
  static void freecolor(struct colormap *, color);
  static color pseudocolor(struct colormap *);
-static color subcolor(struct colormap *, chr c);
+static color subcolor(struct colormap *, chr);
+static color subcolorhi(struct colormap *, color *);
  static color newsub(struct colormap *, color);
-static void subrange(struct vars *, chr, chr, struct state *, struct state *);
-static void subblock(struct vars *, chr, struct state *, struct state *);
+static int newhicolorrow(struct colormap *, int);
+static void newhicolorcols(struct colormap *);
+static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *);
+static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *);
+static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *);
+static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *);
  static void okcolors(struct nfa *, struct colormap *);
  static void colorchain(struct colormap *, struct arc *);
  static void uncolorchain(struct colormap *, struct arc *);
@@ -114,7 +116,6 @@ static void colorcomplement(struct nfa *, struct colormap *, int, struct state *
  
  #ifdef REG_DEBUG
  static void dumpcolors(struct colormap *, FILE *);
-static void fillcheck(struct colormap *, union tree *, int, FILE *);
  static void dumpchr(chr, FILE *);
  #endif
  /* === regc_nfa.c === */
@@ -215,6 +216,7 @@ static struct cvec *range(struct vars *, chr, chr, int);
  static int before(chr, chr);
  static struct cvec *eclass(struct vars *, chr, int);
  static struct cvec *cclass(struct vars *, const chr *, const chr *, int);
+static int cclass_column_index(struct colormap *, chr);
  static struct cvec *allcases(struct vars *, chr);
  static int cmp(const chr *, const chr *, size_t);
  static int casecmp(const chr *, const chr *, size_t);
@@ -1467,7 +1469,7 @@ brackpart(struct vars * v,
             NOERR();
             cv = eclass(v, startc, (v->cflags & REG_ICASE));
             NOERR();
-           dovec(v, cv, lp, rp);
+           subcolorcvec(v, cv, lp, rp);
             return;
             break;
         case CCLASS:
@@ -1477,7 +1479,7 @@ brackpart(struct vars * v,
             NOERR();
             cv = cclass(v, startp, endp, (v->cflags & REG_ICASE));
             NOERR();
-           dovec(v, cv, lp, rp);
+           subcolorcvec(v, cv, lp, rp);
             return;
             break;
         default:
@@ -1523,7 +1525,7 @@ brackpart(struct vars * v,
         NOTE(REG_UUNPORT);
     cv = range(v, startc, endc, (v->cflags & REG_ICASE));
     NOERR();
-   dovec(v, cv, lp, rp);
+   subcolorcvec(v, cv, lp, rp);
  }
  
  /*
@@ -1565,46 +1567,14 @@ onechr(struct vars * v,
  {
     if (!(v->cflags & REG_ICASE))
     {
-       newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
+       color       lastsubcolor = COLORLESS;
+
+       subcoloronechr(v, c, lp, rp, &lastsubcolor);
         return;
     }
  
     /* rats, need general case anyway... */
-   dovec(v, allcases(v, c), lp, rp);
-}
-
-/*
- * dovec - fill in arcs for each element of a cvec
- */
-static void
-dovec(struct vars * v,
-     struct cvec * cv,
-     struct state * lp,
-     struct state * rp)
-{
-   chr         ch,
-               from,
-               to;
-   const chr  *p;
-   int         i;
-
-   /* ordinary characters */
-   for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
-   {
-       ch = *p;
-       newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp);
-       NOERR();
-   }
-
-   /* and the ranges */
-   for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
-   {
-       from = *p;
-       to = *(p + 1);
-       if (from <= to)
-           subrange(v, from, to, lp, rp);
-       NOERR();
-   }
+   subcolorcvec(v, allcases(v, c), lp, rp);
  }
  
  /*
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c

index 93da82286f3d556f02e29031a2942b8211ac2531..a515949bb4d738c5aa6598ce443ab4bee7365d51 100644 (file)
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -28,10 +28,6 @@
  
  #include "regex/regexport.h"
  
-static void scancolormap(struct colormap * cm, int co,
-            union tree * t, int level, chr partial,
-            pg_wchar **chars, int *chars_len);
-
  
  /*
   * Get total number of NFA states.
@@ -187,10 +183,7 @@ pg_reg_colorisend(const regex_t *regex, int co)
   *
   * Note: we return -1 if the color number is invalid, or if it is a special
   * color (WHITE or a pseudocolor), or if the number of members is uncertain.
- * The latter case cannot arise right now but is specified to allow for future
- * improvements (see musings about run-time handling of higher character codes
- * in regex/README).  Callers should not try to extract the members if -1 is
- * returned.
+ * Callers should not try to extract the members if -1 is returned.
   */
  int
  pg_reg_getnumcharacters(const regex_t *regex, int co)
@@ -205,7 +198,18 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
     if (cm->cd[co].flags & PSEUDO)      /* also pseudocolors (BOS etc) */
         return -1;
  
-   return cm->cd[co].nchrs;
+   /*
+    * If the color appears anywhere in the high colormap, treat its number of
+    * members as uncertain.  In principle we could determine all the specific
+    * chrs corresponding to each such entry, but it would be expensive
+    * (particularly if character class tests are required) and it doesn't
+    * seem worth it.
+    */
+   if (cm->cd[co].nuchrs != 0)
+       return -1;
+
+   /* OK, return the known number of member chrs */
+   return cm->cd[co].nschrs;
  }
  
  /*
@@ -222,6 +226,7 @@ pg_reg_getcharacters(const regex_t *regex, int co,
                      pg_wchar *chars, int chars_len)
  {
     struct colormap *cm;
+   chr         c;
  
     assert(regex != NULL && regex->re_magic == REMAGIC);
     cm = &((struct guts *) regex->re_guts)->cmap;
@@ -231,62 +236,17 @@ pg_reg_getcharacters(const regex_t *regex, int co,
     if (cm->cd[co].flags & PSEUDO)
         return;
  
-   /* Recursively search the colormap tree */
-   scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len);
-}
-
-/*
- * Recursively scan the colormap tree to find chrs belonging to color "co".
- * See regex/README for info about the tree structure.
- *
- * t: tree block to scan
- * level: level (from 0) of t
- * partial: partial chr code for chrs within t
- * chars, chars_len: output area
- */
-static void
-scancolormap(struct colormap * cm, int co,
-            union tree * t, int level, chr partial,
-            pg_wchar **chars, int *chars_len)
-{
-   int         i;
-
-   if (level < NBYTS - 1)
-   {
-       /* non-leaf node */
-       for (i = 0; i < BYTTAB; i++)
-       {
-           /*
-            * We do not support search for chrs of color 0 (WHITE), so
-            * all-white subtrees need not be searched.  These can be
-            * recognized because they are represented by the fill blocks in
-            * the colormap struct.  This typically allows us to avoid
-            * scanning large regions of higher-numbered chrs.
-            */
-           if (t->tptr[i] == &cm->tree[level + 1])
-               continue;
-
-           /* Recursively scan next level down */
-           scancolormap(cm, co,
-                        t->tptr[i], level + 1,
-                        (partial | (chr) i) << BYTBITS,
-                        chars, chars_len);
-       }
-   }
-   else
+   /*
+    * We need only examine the low character map; there should not be any
+    * matching entries in the high map.
+    */
+   for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
     {
-       /* leaf node */
-       for (i = 0; i < BYTTAB; i++)
+       if (cm->locolormap[c - CHR_MIN] == co)
         {
-           if (t->tcolor[i] == co)
-           {
-               if (*chars_len > 0)
-               {
-                   **chars = partial | (chr) i;
-                   (*chars)++;
-                   (*chars_len)--;
-               }
-           }
+           *chars++ = c;
+           if (--chars_len == 0)
+               break;
         }
     }
  }
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c

index 04b6449a20cfb12079660bbe1782a76454bab0c5..f338715225070c970536668aebd7d92b14122afc 100644 (file)
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -194,7 +194,10 @@ findprefix(struct cnfa * cnfa,
         if (thiscolor == COLORLESS)
             break;
         /* The color must be a singleton */
-       if (cm->cd[thiscolor].nchrs != 1)
+       if (cm->cd[thiscolor].nschrs != 1)
+           break;
+       /* Must not have any high-color-map entries */
+       if (cm->cd[thiscolor].nuchrs != 0)
             break;
  
         /*
diff --git a/src/include/regex/regcustom.h b/src/include/regex/regcustom.h

index 459851a7f6bdff8a04b0d3136fc61e4a64c21dc5..04a1893c806ff38556a6f1e490936128ec355fed 100644 (file)
--- a/src/include/regex/regcustom.h
+++ b/src/include/regex/regcustom.h
@@ -77,6 +77,16 @@ typedef unsigned uchr;           /* unsigned type that will hold a chr */
   */
  #define CHR_IS_IN_RANGE(c) ((c) <= CHR_MAX)
  
+/*
+ * MAX_SIMPLE_CHR is the cutoff between "simple" and "complicated" processing
+ * in the color map logic.  It should usually be chosen high enough to ensure
+ * that all common characters are <= MAX_SIMPLE_CHR.  However, very large
+ * values will be counterproductive since they cause more regex setup time.
+ * Also, small values can be helpful for testing the high-color-map logic
+ * with plain old ASCII input.
+ */
+#define MAX_SIMPLE_CHR 0x7FF   /* suitable value for Unicode */
+
  /* functions operating on chr */
  #define iscalnum(x) pg_wc_isalnum(x)
  #define iscalpha(x) pg_wc_isalpha(x)
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h

index 2f89dc9326bfab16a10bd6ba070a6b1230c3ff0c..cc73db25473c06e058555021e08cd5d1e2c75417 100644 (file)
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -172,6 +172,5 @@ extern int  pg_regexec(regex_t *, const pg_wchar *, size_t, size_t, rm_detail_t *
  extern int pg_regprefix(regex_t *, pg_wchar **, size_t *);
  extern void pg_regfree(regex_t *);
  extern size_t pg_regerror(int, const regex_t *, char *, size_t);
-extern void pg_set_regex_collation(Oid collation);
  
  #endif   /* _REGEX_H_ */
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h

index b0aa641cc4f6d851e19b3635edb9a8d3ddfbeaeb..69816f1449a6cbf038134f9bce94243be5660701 100644 (file)
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -127,23 +127,6 @@
  #define ISBSET(uv, sn) ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS)))
  
  
-
-/*
- * We dissect a chr into byts for colormap table indexing.  Here we define
- * a byt, which will be the same as a byte on most machines...  The exact
- * size of a byt is not critical, but about 8 bits is good, and extraction
- * of 8-bit chunks is sometimes especially fast.
- */
-#ifndef BYTBITS
-#define BYTBITS 8              /* bits in a byt */
-#endif
-#define BYTTAB (1<
-#define BYTMASK (BYTTAB-1)     /* bit mask for byt */
-#define NBYTS  ((CHRBITS+BYTBITS-1)/BYTBITS)
-/* the definition of GETCOLOR(), below, assumes NBYTS <= 4 */
-
-
-
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
@@ -153,42 +136,9 @@ typedef short color;           /* colors of characters */
 #define MAX_COLOR  32767       /* max color (must fit in 'color' datatype) */
 #define COLORLESS  (-1)        /* impossible color */
 #define WHITE      0           /* default color, parent of all others */
-
+/* Note: various places in the code know that WHITE is zero */
 
 
-/*
- * A colormap is a tree -- more precisely, a DAG -- indexed at each level
- * by a byt of the chr, to map the chr to a color efficiently.  Because
- * lower sections of the tree can be shared, it can exploit the usual
- * sparseness of such a mapping table.  The tree is always NBYTS levels
- * deep (in the past it was shallower during construction but was "filled"
- * to full depth at the end of that); areas that are unaltered as yet point
- * to "fill blocks" which are entirely WHITE in color.
- *
- * Leaf-level tree blocks are of type "struct colors", while upper-level
- * blocks are of type "struct ptrs".  Pointers into the tree are generally
- * declared as "union tree *" to be agnostic about what level they point to.
- */
-
-/* the tree itself */
-struct colors
-{
-   color       ccolor[BYTTAB];
-};
-struct ptrs
-{
-   union tree *pptr[BYTTAB];
-};
-union tree
-{
-   struct colors colors;
-   struct ptrs ptrs;
-};
-
-/* use these pseudo-field names when dereferencing a "union tree" pointer */
-#define tcolor colors.ccolor
-#define tptr   ptrs.pptr
-
 /*
  * Per-color data structure for the compile-time color machinery
  *
@@ -203,26 +153,56 @@ union tree
  */
 struct colordesc
 {
-   uchr        nchrs;          /* number of chars of this color */
+   int         nschrs;         /* number of simple chars of this color */
+   int         nuchrs;         /* number of upper map entries of this color */
    color       sub;            /* open subcolor, if any; or free-chain ptr */
 #define  NOSUB  COLORLESS      /* value of "sub" when no open subcolor */
    struct arc *arcs;           /* chain of all arcs of this color */
-   chr         firstchr;       /* char first assigned to this color */
+   chr         firstchr;       /* simple char first assigned to this color */
    int         flags;          /* bit values defined next */
 #define  FREECOL 01                /* currently free */
 #define  PSEUDO  02                /* pseudocolor, no real chars */
 #define  UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
-   union tree *block;          /* block of solid color, if any */
 };
 
 /*
  * The color map itself
  *
- * Much of the data in the colormap struct is only used at compile time.
- * However, the bulk of the space usage is in the "tree" structure, so it's
- * not clear that there's much point in converting the rest to a more compact
- * form when compilation is finished.
+ * This struct holds both data used only at compile time, and the chr to
+ * color mapping information, used at both compile and run time.  The latter
+ * is the bulk of the space, so it's not really worth separating out the
+ * compile-only portion.
+ *
+ * Ideally, the mapping data would just be an array of colors indexed by
+ * chr codes; but for large character sets that's impractical.  Fortunately,
+ * common characters have smaller codes, so we can use a simple array for chr
+ * codes up to MAX_SIMPLE_CHR, and do something more complex for codes above
+ * that, without much loss of performance.  The "something more complex" is a
+ * 2-D array of color entries, where row indexes correspond to individual chrs
+ * or chr ranges that have been mentioned in the regex (with row zero
+ * representing all other chrs), and column indexes correspond to different
+ * sets of locale-dependent character classes such as "isalpha".  The
+ * classbits[k] entry is zero if we do not care about the k'th character class
+ * in this regex, and otherwise it is the bit to be OR'd into the column index
+ * if the character in question is a member of that class.  We find the color
+ * of a high-valued chr by identifying which colormaprange it is in to get
+ * the row index (use row zero if it's in none of them), identifying which of
+ * the interesting cclasses it's in to get the column index, and then indexing
+ * into the 2-D hicolormap array.
+ *
+ * The colormapranges are required to be nonempty, nonoverlapping, and to
+ * appear in increasing chr-value order.
  */
+
+#define NUM_CCLASSES 13            /* must match data in regc_locale.c */
+
+typedef struct colormaprange
+{
+   chr         cmin;           /* range represents cmin..cmax inclusive */
+   chr         cmax;
+   int         rownum;         /* row index in hicolormap array (>= 1) */
+} colormaprange;
+
 struct colormap
 {
    int         magic;
@@ -233,27 +213,27 @@ struct colormap
    color       free;           /* beginning of free chain (if non-0) */
    struct colordesc *cd;       /* pointer to array of colordescs */
 #define  CDEND(cm)  (&(cm)->cd[(cm)->max + 1])
+
+   /* mapping data for chrs <= MAX_SIMPLE_CHR: */
+   color      *locolormap;     /* simple array indexed by chr code */
+
+   /* mapping data for chrs > MAX_SIMPLE_CHR: */
+   int         classbits[NUM_CCLASSES];        /* see comment above */
+   int         numcmranges;    /* number of colormapranges */
+   colormaprange *cmranges;    /* ranges of high chrs */
+   color      *hicolormap;     /* 2-D array of color entries */
+   int         maxarrayrows;   /* number of array rows allocated */
+   int         hiarrayrows;    /* number of array rows in use */
+   int         hiarraycols;    /* number of array columns (2^N) */
+
    /* If we need up to NINLINECDS, we store them here to save a malloc */
-#define  NINLINECDS  ((size_t)10)
+#define  NINLINECDS  ((size_t) 10)
    struct colordesc cdspace[NINLINECDS];
-   union tree  tree[NBYTS];    /* tree top, plus lower-level fill blocks */
 };
 
-/* optimization magic to do fast chr->color mapping */
-#define B0(c)  ((c) & BYTMASK)
-#define B1(c)  (((c)>>BYTBITS) & BYTMASK)
-#define B2(c)  (((c)>>(2*BYTBITS)) & BYTMASK)
-#define B3(c)  (((c)>>(3*BYTBITS)) & BYTMASK)
-#if NBYTS == 1
-#define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)])
-#endif
-/* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */
-#if NBYTS == 2
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)])
-#endif
-#if NBYTS == 4
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)])
-#endif
+/* fetch color for chr; beware of multiple evaluation of c argument */
+#define GETCOLOR(cm, c) \
+   ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c))
 
 
 /*
@@ -264,6 +244,11 @@ struct colormap
  * Representation of a set of characters.  chrs[] represents individual
  * code points, ranges[] represents ranges in the form min..max inclusive.
  *
+ * If the cvec represents a locale-specific character class, eg [[:alpha:]],
+ * then the chrs[] and ranges[] arrays contain only members of that class
+ * up to MAX_SIMPLE_CHR (inclusive).  cclasscode is set to regc_locale.c's
+ * code for the class, rather than being -1 as it is in an ordinary cvec.
+ *
  * Note that in cvecs gotten from newcvec() and intended to be freed by
  * freecvec(), both arrays of chrs are after the end of the struct, not
  * separately malloc'd; so chrspace and rangespace are effectively immutable.
@@ -276,6 +261,7 @@ struct cvec
    int         nranges;        /* number of ranges (chr pairs) */
    int         rangespace;     /* number of ranges allocated in ranges[] */
    chr        *ranges;         /* pointer to vector of chr pairs */
+   int         cclasscode;     /* value of "enum classes", or -1 */
 };
 
 
@@ -489,3 +475,8 @@ struct guts
    int         nlacons;        /* size of lacons[]; note that only slots
                                 * numbered 1 .. nlacons-1 are used */
 };
+
+
+/* prototypes for functions that are exported from regcomp.c to regexec.c */
+extern void pg_set_regex_collation(Oid collation);
+extern color pg_reg_getcolor(struct colormap * cm, chr c);
-#define BYTMASK (BYTTAB-1)     /* bit mask for byt */
-#define NBYTS  ((CHRBITS+BYTBITS-1)/BYTBITS)
-/* the definition of GETCOLOR(), below, assumes NBYTS <= 4 */
-
-
-
  /*
   * As soon as possible, we map chrs into equivalence classes -- "colors" --
   * which are of much more manageable number.
@@ -153,42 +136,9 @@ typedef short color;           /* colors of characters */
  #define MAX_COLOR  32767       /* max color (must fit in 'color' datatype) */
  #define COLORLESS  (-1)        /* impossible color */
  #define WHITE      0           /* default color, parent of all others */
-
+/* Note: various places in the code know that WHITE is zero */
  
  
-/*
- * A colormap is a tree -- more precisely, a DAG -- indexed at each level
- * by a byt of the chr, to map the chr to a color efficiently.  Because
- * lower sections of the tree can be shared, it can exploit the usual
- * sparseness of such a mapping table.  The tree is always NBYTS levels
- * deep (in the past it was shallower during construction but was "filled"
- * to full depth at the end of that); areas that are unaltered as yet point
- * to "fill blocks" which are entirely WHITE in color.
- *
- * Leaf-level tree blocks are of type "struct colors", while upper-level
- * blocks are of type "struct ptrs".  Pointers into the tree are generally
- * declared as "union tree *" to be agnostic about what level they point to.
- */
-
-/* the tree itself */
-struct colors
-{
-   color       ccolor[BYTTAB];
-};
-struct ptrs
-{
-   union tree *pptr[BYTTAB];
-};
-union tree
-{
-   struct colors colors;
-   struct ptrs ptrs;
-};
-
-/* use these pseudo-field names when dereferencing a "union tree" pointer */
-#define tcolor colors.ccolor
-#define tptr   ptrs.pptr
-
  /*
   * Per-color data structure for the compile-time color machinery
   *
@@ -203,26 +153,56 @@ union tree
   */
  struct colordesc
  {
-   uchr        nchrs;          /* number of chars of this color */
+   int         nschrs;         /* number of simple chars of this color */
+   int         nuchrs;         /* number of upper map entries of this color */
     color       sub;            /* open subcolor, if any; or free-chain ptr */
  #define  NOSUB  COLORLESS      /* value of "sub" when no open subcolor */
     struct arc *arcs;           /* chain of all arcs of this color */
-   chr         firstchr;       /* char first assigned to this color */
+   chr         firstchr;       /* simple char first assigned to this color */
     int         flags;          /* bit values defined next */
  #define  FREECOL 01                /* currently free */
  #define  PSEUDO  02                /* pseudocolor, no real chars */
  #define  UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
-   union tree *block;          /* block of solid color, if any */
  };
  
  /*
   * The color map itself
   *
- * Much of the data in the colormap struct is only used at compile time.
- * However, the bulk of the space usage is in the "tree" structure, so it's
- * not clear that there's much point in converting the rest to a more compact
- * form when compilation is finished.
+ * This struct holds both data used only at compile time, and the chr to
+ * color mapping information, used at both compile and run time.  The latter
+ * is the bulk of the space, so it's not really worth separating out the
+ * compile-only portion.
+ *
+ * Ideally, the mapping data would just be an array of colors indexed by
+ * chr codes; but for large character sets that's impractical.  Fortunately,
+ * common characters have smaller codes, so we can use a simple array for chr
+ * codes up to MAX_SIMPLE_CHR, and do something more complex for codes above
+ * that, without much loss of performance.  The "something more complex" is a
+ * 2-D array of color entries, where row indexes correspond to individual chrs
+ * or chr ranges that have been mentioned in the regex (with row zero
+ * representing all other chrs), and column indexes correspond to different
+ * sets of locale-dependent character classes such as "isalpha".  The
+ * classbits[k] entry is zero if we do not care about the k'th character class
+ * in this regex, and otherwise it is the bit to be OR'd into the column index
+ * if the character in question is a member of that class.  We find the color
+ * of a high-valued chr by identifying which colormaprange it is in to get
+ * the row index (use row zero if it's in none of them), identifying which of
+ * the interesting cclasses it's in to get the column index, and then indexing
+ * into the 2-D hicolormap array.
+ *
+ * The colormapranges are required to be nonempty, nonoverlapping, and to
+ * appear in increasing chr-value order.
   */
+
+#define NUM_CCLASSES 13            /* must match data in regc_locale.c */
+
+typedef struct colormaprange
+{
+   chr         cmin;           /* range represents cmin..cmax inclusive */
+   chr         cmax;
+   int         rownum;         /* row index in hicolormap array (>= 1) */
+} colormaprange;
+
  struct colormap
  {
     int         magic;
@@ -233,27 +213,27 @@ struct colormap
     color       free;           /* beginning of free chain (if non-0) */
     struct colordesc *cd;       /* pointer to array of colordescs */
  #define  CDEND(cm)  (&(cm)->cd[(cm)->max + 1])
+
+   /* mapping data for chrs <= MAX_SIMPLE_CHR: */
+   color      *locolormap;     /* simple array indexed by chr code */
+
+   /* mapping data for chrs > MAX_SIMPLE_CHR: */
+   int         classbits[NUM_CCLASSES];        /* see comment above */
+   int         numcmranges;    /* number of colormapranges */
+   colormaprange *cmranges;    /* ranges of high chrs */
+   color      *hicolormap;     /* 2-D array of color entries */
+   int         maxarrayrows;   /* number of array rows allocated */
+   int         hiarrayrows;    /* number of array rows in use */
+   int         hiarraycols;    /* number of array columns (2^N) */
+
     /* If we need up to NINLINECDS, we store them here to save a malloc */
-#define  NINLINECDS  ((size_t)10)
+#define  NINLINECDS  ((size_t) 10)
     struct colordesc cdspace[NINLINECDS];
-   union tree  tree[NBYTS];    /* tree top, plus lower-level fill blocks */
  };
  
-/* optimization magic to do fast chr->color mapping */
-#define B0(c)  ((c) & BYTMASK)
-#define B1(c)  (((c)>>BYTBITS) & BYTMASK)
-#define B2(c)  (((c)>>(2*BYTBITS)) & BYTMASK)
-#define B3(c)  (((c)>>(3*BYTBITS)) & BYTMASK)
-#if NBYTS == 1
-#define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)])
-#endif
-/* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */
-#if NBYTS == 2
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)])
-#endif
-#if NBYTS == 4
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)])
-#endif
+/* fetch color for chr; beware of multiple evaluation of c argument */
+#define GETCOLOR(cm, c) \
+   ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c))
  
  
  /*
@@ -264,6 +244,11 @@ struct colormap
   * Representation of a set of characters.  chrs[] represents individual
   * code points, ranges[] represents ranges in the form min..max inclusive.
   *
+ * If the cvec represents a locale-specific character class, eg [[:alpha:]],
+ * then the chrs[] and ranges[] arrays contain only members of that class
+ * up to MAX_SIMPLE_CHR (inclusive).  cclasscode is set to regc_locale.c's
+ * code for the class, rather than being -1 as it is in an ordinary cvec.
+ *
   * Note that in cvecs gotten from newcvec() and intended to be freed by
   * freecvec(), both arrays of chrs are after the end of the struct, not
   * separately malloc'd; so chrspace and rangespace are effectively immutable.
@@ -276,6 +261,7 @@ struct cvec
     int         nranges;        /* number of ranges (chr pairs) */
     int         rangespace;     /* number of ranges allocated in ranges[] */
     chr        *ranges;         /* pointer to vector of chr pairs */
+   int         cclasscode;     /* value of "enum classes", or -1 */
  };
  
  
@@ -489,3 +475,8 @@ struct guts
     int         nlacons;        /* size of lacons[]; note that only slots
                                  * numbered 1 .. nlacons-1 are used */
  };
+
+
+/* prototypes for functions that are exported from regcomp.c to regexec.c */
+extern void pg_set_regex_collation(Oid collation);
+extern color pg_reg_getcolor(struct colormap * cm, chr c);
diff --git a/src/test/regress/expected/regex.linux.utf8.out b/src/test/regress/expected/regex.linux.utf8.out

new file mode 100644 (file)

index 0000000..7c170a9
--- /dev/null
+++ b/src/test/regress/expected/regex.linux.utf8.out
@@ -0,0 +1,164 @@
+/*
+ * This test is for Linux/glibc systems and others that implement proper
+ * locale classification of Unicode characters with high code values.
+ * It must be run in a database with UTF8 encoding and a Unicode-aware locale.
+ */
+SET client_encoding TO UTF8;
+--
+-- Test the "high colormap" logic with single characters and ranges that
+-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
+--
+-- trivial cases:
+SELECT 'aⓐ' ~ U&'a\24D0' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓐ' ~ U&'a\24D1' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+-- cases requiring splitting of ranges:
+SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
+ f 
+---
+ f
+(1 row)
+
+-- Locale-dependent character classes
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
+ f 
+---
+ f
+(1 row)
+
+-- Locale-dependent character classes with high ranges
+SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
+ f 
+---
+ f
+(1 row)
+
+SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;
+ f 
+---
+ f
+(1 row)
+
diff --git a/src/test/regress/sql/regex.linux.utf8.sql b/src/test/regress/sql/regex.linux.utf8.sql

new file mode 100644 (file)

index 0000000..4577811
--- /dev/null
+++ b/src/test/regress/sql/regex.linux.utf8.sql
@@ -0,0 +1,46 @@
+/*
+ * This test is for Linux/glibc systems and others that implement proper
+ * locale classification of Unicode characters with high code values.
+ * It must be run in a database with UTF8 encoding and a Unicode-aware locale.
+ */
+
+SET client_encoding TO UTF8;
+
+--
+-- Test the "high colormap" logic with single characters and ranges that
+-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
+--
+
+-- trivial cases:
+SELECT 'aⓐ' ~ U&'a\24D0' AS t;
+SELECT 'aⓐ' ~ U&'a\24D1' AS f;
+SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
+SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
+-- cases requiring splitting of ranges:
+SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
+SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
+SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
+SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
+SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
+SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
+
+-- Locale-dependent character classes
+
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
+
+-- Locale-dependent character classes with high ranges
+
+SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
+SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
+SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
+SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;
author	Tom Lane
	Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)
committer	Tom Lane
	Mon, 5 Sep 2016 21:06:29 +0000 (17:06 -0400)
src/backend/regex/README		patch \| blob \| blame \| history
src/backend/regex/regc_color.c		patch \| blob \| blame \| history
src/backend/regex/regc_cvec.c		patch \| blob \| blame \| history
src/backend/regex/regc_locale.c		patch \| blob \| blame \| history
src/backend/regex/regc_pg_locale.c		patch \| blob \| blame \| history
src/backend/regex/regcomp.c		patch \| blob \| blame \| history
src/backend/regex/regexport.c		patch \| blob \| blame \| history
src/backend/regex/regprefix.c		patch \| blob \| blame \| history
src/include/regex/regcustom.h		patch \| blob \| blame \| history
src/include/regex/regex.h		patch \| blob \| blame \| history
src/include/regex/regguts.h		patch \| blob \| blame \| history
src/test/regress/expected/regex.linux.utf8.out	[new file with mode: 0644]	patch \| blob
src/test/regress/sql/regex.linux.utf8.sql	[new file with mode: 0644]	patch \| blob