2

I'm using an API which returns text in the following format:

#start
#p 12345 foo
#p 12346 bar
#end
#start
#p 12345 foo2
#p 12346 bar2
#end

My parsing function:

function parseApiResponse(data) {

    var results = [], match, obj;

    while (match = CST.REGEX.POST.exec(/(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig)) {

        if (match[1]) {           // #start
            obj = {};

        } else if (match[2]) {    // #end
            results.push(obj);
            obj = null;           // prevent accidental reuse 
                                  // if input is malformed

        } else {                  // #p something something
            obj[match[3]] = match[4];
        }
    }

    return results;
}

This will give me a list of objects which looks something like this:

[{ '12345': 'foo', '12346': 'bar'}, /* etc... */]

However, if a line is formatted like this

#start
#p 12345
#p 12346 bar
#end

The line would actually be #p 12345\n and my match[4] would contain the next row's #p.

How do I adjust the pattern to adapt to this?

3
  • On your fourth match, you're not allowing white space with \S+. Maybe that will give you a hint. Commented Apr 9, 2014 at 15:22
  • @tgies I just added it since I saw it on regexr.com while testing Commented Apr 9, 2014 at 15:56
  • @Johan never mind, that was a stupid question, I confused myself Commented Apr 9, 2014 at 16:12

1 Answer 1

1

Assuming you have one #start, #end, or #p element per line, you can make your regex aware of this and add an additional non-capturing group to indicate that the last \s+(\S+) in a line is optional:

/(#start)|(#end)|#p\s+(\S+)(?:\s+(\S+))?$/igm

(?: ) is saying "treat this as a group, but don't capture the pattern it matches" (so it won't create an element in match). The ? that follows that group means "this group is optional and may or may not match anything in the pattern". The $ right after that, in conjunction with the m flag, matches the end of the line.

You can also avoid the (?: ) trickery by using * instead of + quantifiers, meaning "match zero or more times": change \s+(\S+) to \s*(\S*). This has the side effect that the space between the number and the data that follows it is now optional.

I would rewrite the regex and refactor the code a bit as follows:

while (match = CST.REGEX.POST.exec(/^#(start|end|p)(?:\s+(\d+)(?:[^\S\r\n]+([^\r\n]+))?)?$/igm)) {
  switch (match[1]) {
    case 'start':
      obj = {};
      break;
    case 'end':
      results.push(obj);
      obj = null;
      break;
    case 'p':
      obj[match[2]] = match[3];
      break;
  }
}

I like capturing start, end, or p in the one capture group so I can use it in a switch statement. The version of the regex I use here is a little more discriminating (expects the token that follows #p to be numeric) and a little more forgiving (allows the last token on a #p line to contain any non-linebreak whitespace, e.g. #p 1138 this is only a test).

Sign up to request clarification or add additional context in comments.

2 Comments

Great, thanks! Final question: How do I allow the last group to contain spaces and digits e.g. #p 12345 foo can also be #p 12345 foo bar 1, where foo bar 1 would be the last group. Too tricky perhaps?
That's doable. You need to change the last optional noncapturing group to (?:[^\S\r\n]+([^\r\n]+))?). That means "anything that's not not whitespace and not a linebreak (= any space other than linebreaks) any number of times, followed by any character other than a linebreak any number of times." I'm editing the answer in a moment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.