0

Question:

I have a function that searches ODBC escape sequences, and then recursively replaces them with their native equivalent.

For example, given an ODBC escape sequence like: {fn concat(column1, column2)}

I replace it with its native SQL equivalent like this

if (StringComparer.OrdinalIgnoreCase.Equals("concat", strFunctionName)) {
    strArguments = strArguments;

    string[] astrArguments = strArguments.Split(',');
    string strTerm = astrArguments[0] + " || " + astrArguments[1];

    return strTerm;
}

Which yields column1 || column2 (PostGreSql syntax)

Generally speaking, I have to resolve the string "arglist_comma_separated" in {fn whatever(arglist_comma_separated)}

into its components

Needless to say, that with my current version, it doesn't work if argument 1 in arglist is something like

'hello, world'

Even worse, if I concat family and first name for example, it could be {fn whatever('d''Amato', 'Alberto')}

How would I best split the arguments ?

How can I do this with regex or by parsing ?

Edit:
Hmm, thought about it again, and had a splendid idea (replace '' with an escape character).
With that, the problem complexity has degraded from medium-high to trivial for a non-regex solution:
Addendum: Actually, only almost trivial, i forgot to consider function nesting in the splitting - stupid.

    protected static string[] GetArguments(string strAllArguments)
    {
        string EscapeCharacter = System.Convert.ToChar(8).ToString();

        strAllArguments = strAllArguments.Replace("''", EscapeCharacter);

        bool bInString = false;
        int iLastSplitAt = 0;
        int iInFunction = 0;

        System.Collections.Generic.List<string> lsArguments = new System.Collections.Generic.List<string>();

        for (int i = 0; i < strAllArguments.Length; i++)
        {
            char strCurrentChar = strAllArguments[i];

            if (strCurrentChar == '\'')
                bInString = !bInString;

            if (bInString)
                continue;


            if (strCurrentChar == '(')
                 iInFunction++;

            if (strCurrentChar == ')')
                 iInFunction--;

            if (strCurrentChar == ',')
            {
                if(iInFunction == 0)
                {
                    string strExtract = strAllArguments.Substring(iLastSplitAt, i - iLastSplitAt);
                    strExtract = strExtract.Replace(EscapeCharacter, "''");
                    lsArguments.Add(strExtract);
                    iLastSplitAt = i;
                }
            }

        }


        string strExtractLast = strAllArguments.Substring(iLastSplitAt + 1);
        strExtractLast = strExtractLast.Replace(EscapeCharacter, "''");
        lsArguments.Add(strExtractLast);

        string[] astrResult = lsArguments.ToArray();
        lsArguments.Clear();
        lsArguments = null;

        return astrResult;
    }
3
  • 1
    A similar question, where the conclusion could be "Don't do it with a regex": stackoverflow.com/questions/4150697/… Commented Mar 26, 2012 at 9:01
  • @DavidBrabant while I agree regexes are not silver bullets, I couldn't think of a better way to handle this case. The arguments here conforms to a regular language (or the SQL lexer wouldn't be able to parse them), so it makes sense to match each of them with a regex. In the other question, the OP wanted to match/split all of them with a single regex, and I totally agree with you that that wouldn't be advisable. Commented Mar 26, 2012 at 9:49
  • @Quandary +1 your new solution looks fine to me, as long as you ensure the EscapeCharacter can't possibly occur in your original input. I'd only suggest moving it out of the question and posting it as a new answer. Commented Mar 28, 2012 at 8:02

1 Answer 1

1

(Edit: I cleaned up this answer a lot, since the many feedback iterations; below are my conclusions on the subject)

Parsing a complex language is a hard task, so I'm assuming you narrowed down your problem to treating a comma-separated list of token values (such as strings, numbers, simple identifiers, etc - not complex expressions). If I'm mistaken, you probably have a bigger problem in your hands than you imagine. In that case, I'd suggest this question as a starting point.

The simplest solution - splitting on , - doesn't work mainly because of strings, since a comma can appear inside a string. Parsing a string is a straightforward task, assuming you deal with escape characters correctly: it starts with a quote, has zero or more characters and ends with another quote.

In most languages, if a string is delimited by ', you can escape a quote in it using \'. SQL interprets a '' inside a string as an escaped quote. If you know only one of those forms will be present, you can disregard the other. In my answer below, I decided to include both.

Also, some languages accept both single quotes (') and double quotes (") to delimit a string. The same observations about escaped characters apply. My solution also deals with both forms.

Besides strings, it's also important to specify what are valid characters for an argument. For simplicity, I assumed it would be "anything that is not a comma". For the same reason, my proposed solution will accept any number of strings and non-strings, and will group them together, returning them as a single entity (reiterating that, if complex expressions are expected, a more general parsing technique should be employed instead of this simple solution).

One way of implementing this would be looping through the chars while applying the logic above, as you've done in your recent update. Another would be using a regex. The regex has as pros better performance (usually) and a cleaner code, less error-prone. The main con is the complexity of the regex itself, since a "dense" format can be harder to understand/maintain.

My proposed regex would be then (spaces/newlines added for readability):

(
    (?:  \'   (?: ['\\]\' | [^'] )*   \'  |
         \"   (?: ["\\]\" | [^"] )*   \"  |
         [^,'"]
    )+
)
(?: \, | $)

In short format:

((?:\'(?:['\\]\'|[^'])*\'|\"(?:["\\]\"|[^"])*\"|[^,'"])+)(?:\,|$)

Each string accepts as "characters" either escaped quotes (' or \ followed by ') or anything that is not a quote. The match (the big capturing group) must be followed by either a , or the end of the input.

A live example of the regex above can be seen here (the example uses Ruby, but should work equally in C#). As long as the whole input is matched (i.e. no unmatched substring exists), each match will capture the argument correctly. Warning: malformed inputs will produce incorrect outputs, so the regex above must not be used for validation.

To employ this solution in your C# code, you can use Regex.Matches:

MatchCollection matches = Regex.Matches(strArguments, "((?:\'(?:['\\]\'|[^'])*\'|\"(?:["\\]\"|[^"])*\"|[^,'"])+)(?:\,|$)");
string[] arguments = from m in matches select m.Captures[1].Value;

As noted above, you must also ensure matches cover the whole input. I leave that as an exercise for the reader... ;)

Notes:

  1. I'm assuming the results of Matches are non-overlapping; if I'm mistaken, the code above must be adapted to do each match starting at the index the previous one ends;
  2. I'm also assuming, as usual, that the capturing group #0 will be the whole match, and the #1 will be first capturing group.
Sign up to request clarification or add additional context in comments.

7 Comments

That doesn't work on strArguments = "'hello d''someapostroph, world', 'test', foo"; So it will fail on {fn concat('d''Amato', 'Alberto')}. Arguments can, but don't have to be enclosed in quotes.
Sorry, was editing my answer to fix some mistake, and didn't see the comment. The new regex should work, see this example
I correct myself: It doesn't work. This regex is much worse then my split function. It already fails at: "AND {fn LCASE(BE_User)} LIKE {fn LCASE({fn CONCAT(@strTerm, '%')} )} ". It takes away the at. Then it fails on many more places as well.
You must specify what is considered valid in the unescaped part for the regex to work. Replace \w+ for [\w@]+ and you fix the "@", for instance. Another option would be just matching "whatever's not a comma", be it valid input or not... See the update above.
mgibsonbr: Now failing on "{fn concat({fn concat('(', DG_Code)}, ') ')}, DG_Lang_DE" with strArguments = "'(' || DG_Code, ') '";
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.