From sumrn@dssv01.crd.ge.comMon Apr  1 12:24:06 1996
Date: Mon, 18 Mar 96 08:20:45 -0500
From: sumrn@dssv01.crd.ge.com
To: msql-list@bunyip.com
Cc: sumrn@dssv01.crd.ge.com
Subject: Re: [mSQL] possible bug in msql 1.013

All,

  Because a number of people running into the problems with the LIKE
operator, I am including below a set of diffs to src/msql/msqldb.c from
the 1.0.13 distribution that are an experimental patch that I have been using.
I have not seen it fail, but I have not had the time to test it as extensively
as I would like.  I did send it to a couple other folks on the list, but I
haven't heard back from them.  No news is good news?

  In any case, here it is for you all to play with.  It attempts to
fix three problems that I noticed with the LIKE operator:

  1. The LIKE problem with strings as long as the table definition is fixed.
     (Correctness gain, but performance loss due to malloc and strncpy.)

  2. The execution of some of the regular comparison code by the NOT LIKE
     operator is fixed. (This should probably be a miniscule performance
     increase--except that it is completely eaten by 1.)

  3. Tamed the deletion of backslashes during the mSQL to regexp translator
     code to make treatement of special characters more uniform.  In other
     words, all regexp special characters except backslash require only
     two backslashes in front of them to be treated normally.  (One is
     stripped by mSQL; the second by the regexp package.)  The backslash
     character now only requires four backslashes because it is the
     escape character for both mSQL and the regexp package.  The single
     quote, of course, remains the same needing only one backslash
     because it is special only to mSQL.

  I noticed after the fact, that some of the extra comments that I left in
about what was going on are a little bit off, but don't worry for now.
Try it out, if you need it, and let me know if it works.

  -- Bob


  [ Part 2: "msqldb.diff" ]

*** msqldb.c    1996/03/07 16:59:36     1.1
--- msqldb.c    1996/03/07 22:11:04
***************
*** 3096,3136 ****
                *re;
        int     maxLen;
  {
!       char    regbuf[1024],
!               hold;
        REG     char *cp1, *cp2;
        regexp  *reg;
!       int     res;

        /*
        ** Map an SQL regexp into a UNIX regexp
        */
        cp1 = re;
        cp2 = regbuf;
        (void)bzero(regbuf,sizeof(regbuf));
        *cp2++ = '^';
        while(*cp1 && maxLen)
        {
                switch(*cp1)
                {
                        case '\\':
!                               if (*(cp1+1))
!                               {
!                                       cp1++;
!                                       *cp2 = *cp1;
                                }
!                               cp1++;
!                               cp2++;
                                break;

                        case '_':
                                *cp2++ = '.';
                                cp1++;
                                break;

                        case '%':
                                *cp2++ = '.';
                                *cp2++ = '*';
                                cp1++;
                                break;

--- 3096,3179 ----
                *re;
        int     maxLen;
  {
!       char    regbuf[1024];
!       char    *hold;
        REG     char *cp1, *cp2;
        regexp  *reg;
!       int     res, rbi;

        /*
        ** Map an SQL regexp into a UNIX regexp
+       **
+       ** RNS **
+       ** Note:  ANSI SQL standard regexps are really globs where % is *
+       ** and _ is ?.
+       **
+       ** The current mSQL does glob-like regexps as follows:
+       ** 1. it always matches the whole string (the ^ and $ added below),
+       ** 2. it does not allow the ., *, and + regexps,
+       ** 3. it does allow character classes i.e., [], and
+       ** 4. it does allow alternation using | and ().
+       **
+       ** However, msql 1.0.13 and earlier are too eager to strip
+       ** backslashes from the incoming expression (re).  This version now
+       ** strips the backslashes only when the following character is
+       ** a % or _.
+       **
+       ** Also, str is NOT guaranteed to be NULL terminated while
+       ** re is guaranteed to be NULL terminated.
+       **
+       ** My personal opinion is that one should explicitly do one or both of
+       ** 1. accept only the ANSI stuff (which is really poor) or
+       ** 2. accept the full regexp stuff.
+       ** Note that in case 1 there are all kinds of
+       ** gotchas for the translator to the regexp mechanism.
+       ** Maybe there should be LIKE for case 1 and RLIKE for case 2?
+       ** (Tcl has both glob and regexp commands though the former works
+       ** on the file system and the latter works on strings passed to it.)
+       **
+       ** I also added a check for overflowing regbuf even if it
+       ** would be rare.  One could malloc a buffer 2*strlen(re) in order
+       ** to avoid the check.  I am not sure what's better.
+       ** RNS **
        */
        cp1 = re;
        cp2 = regbuf;
+       rbi = 0;
        (void)bzero(regbuf,sizeof(regbuf));
        *cp2++ = '^';
+       rbi++;
        while(*cp1 && maxLen)
        {
                switch(*cp1)
                {
                        case '\\':
!                               /*
!                               ** RNS
!                               ** To get old escape handling back, remove
!                               ** all of expression starting with && through
!                               ** the next-to-last closing parenthesis,
!                               ** inclusive.
!                               ** RNS
!                               */
!                               if (*(cp1+1) &&
!                                   ((*(cp1+1) == '%') || (*(cp1+1) == '_'))) {
!                                 cp1++;
                                }
!                               *cp2++ = *cp1++;
!                               rbi++;
                                break;

                        case '_':
                                *cp2++ = '.';
+                               rbi++;
                                cp1++;
                                break;

                        case '%':
                                *cp2++ = '.';
                                *cp2++ = '*';
+                               rbi += 2;
                                cp1++;
                                break;

***************
*** 3139,3167 ****
                        case '+':
                                *cp2++ = '\\';
                                *cp2++ = *cp1++;
                                break;

                        default:
                                *cp2++ = *cp1++;
                                break;
                }
        }
        *cp2 = '$';

        /*
!       ** Do the regexp thang.  We do an ugly hack here : The data of
!       ** a field may be exactly the same length as the field itself.
!       ** Seeing as the regexp routines work on null rerminated strings
!       ** if the field is totally full we get field over-run.  So,
!       ** store the value of the last byte, null it out, run the regexp
!       ** and then reset it (hey, I said it was ugly).
        */
        regErrFlag = 0;
!       hold = *(str + maxLen - 1);
!       *(str + maxLen - 1) = 0;
        reg = regcomp(regbuf);
!       res = regexec(reg,str);
!       *(str + maxLen - 1) = hold;
        safeFree(reg);
        if (regErrFlag)
        {
--- 3182,3220 ----
                        case '+':
                                *cp2++ = '\\';
                                *cp2++ = *cp1++;
+                               rbi += 2;
                                break;

                        default:
                                *cp2++ = *cp1++;
+                               rbi++;
                                break;
                }
+       if (rbi >= sizeof(regbuf)) {
+         strcpy(errMsg, BAD_LIKE_ERROR);
+         msqlDebug(MOD_ERR, "LIKE clause expression too long\n");
+         return(-1);
+       }
        }
        *cp2 = '$';

        /*
!       ** RNS **
!       ** With the malloc and strncpy, it costs alot of performance,
!       ** but it compares things correctly.
!       ** RNS **
        */
        regErrFlag = 0;
!       if ((hold = (char *)malloc(maxLen + 1)) == NULL) {
!               strcpy(errMsg, BAD_LIKE_ERROR);
!               msqlDebug(MOD_ERR, "LIKE clause out of memory\n");
!               return(-1);
!       }
!       strncpy( hold, str, maxLen );
!       hold[maxLen] = '\0';
        reg = regcomp(regbuf);
!       res = regexec(reg,hold);
!       (void)free(hold);
        safeFree(reg);
        if (regErrFlag)
        {
***************
*** 3242,3249 ****
        REG     char    *c1,*c2;
        REG     int     offset;

!       if (op != LIKE_OP)
!       {
                c1 = v1;
                c2 = v2;
                offset=0;
--- 3295,3303 ----
        REG     char    *c1,*c2;
        REG     int     offset;

!       if ((op == LIKE_OP) || (op == NOT_LIKE_OP)) {
!         cmp = regexpTest( v1, v2, maxLen );
!       } else {
                c1 = v1;
                c2 = v2;
                offset=0;
***************
*** 3286,3296 ****
                        break;

                case LIKE_OP:
!                       result = regexpTest(v1,v2,maxLen);
                        break;

                case NOT_LIKE_OP:
!                       result = !(regexpTest(v1,v2,maxLen));
                        break;
        }
        return(result);
--- 3340,3350 ----
                        break;

                case LIKE_OP:
!                       result = cmp;
                        break;

                case NOT_LIKE_OP:
!                       result = !cmp;
                        break;
        }
        return(result);

  [ Part 3: "Attached Text" ]

Robert Sum                   Phone:              +1 (518) 387-7696
G.E. Corp. R. & D.           E-mail:             sumrn@crd.ge.com
P.O. Box 8, Rm. KW-C279      Eng. & Manuf. WWW:  http://ce-toolkit.crd.ge.com
Schenectady, NY 12301 USA    GE's general WWW:   http://www.ge.com

Please note that the "Standard Disclaimer" applies here.
