Parsing text to object with regex

Question

I'm using an API which returns text in the following format:

#start
#p 12345 foo
#p 12346 bar
#end
#start
#p 12345 foo2
#p 12346 bar2
#end

My parsing function:

function parseApiResponse(data) {

    var results = [], match, obj;

    while (match = CST.REGEX.POST.exec(/(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig)) {

        if (match[1]) {           // #start
            obj = {};

        } else if (match[2]) {    // #end
            results.push(obj);
            obj = null;           // prevent accidental reuse 
                                  // if input is malformed

        } else {                  // #p something something
            obj[match[3]] = match[4];
        }
    }

    return results;
}

This will give me a list of objects which looks something like this:

[{ '12345': 'foo', '12346': 'bar'}, /* etc... */]

However, if a line is formatted like this

#start
#p 12345
#p 12346 bar
#end

The line would actually be #p 12345\n and my match[4] would contain the next row's #p.

How do I adjust the pattern to adapt to this?

On your fourth match, you're not allowing white space with \S+. Maybe that will give you a hint. — kei
– kei, Commented Apr 9, 2014 at 15:22
@tgies I just added it since I saw it on regexr.com while testing — Johan
– Johan, Commented Apr 9, 2014 at 15:56
@Johan never mind, that was a stupid question, I confused myself — tgies
– tgies, Commented Apr 9, 2014 at 16:12

tgies · Accepted Answer · 2014-04-09 22:16:56Z

1

Assuming you have one #start, #end, or #p element per line, you can make your regex aware of this and add an additional non-capturing group to indicate that the last \s+(\S+) in a line is optional:

/(#start)|(#end)|#p\s+(\S+)(?:\s+(\S+))?$/igm

(?: ) is saying "treat this as a group, but don't capture the pattern it matches" (so it won't create an element in match). The ? that follows that group means "this group is optional and may or may not match anything in the pattern". The $ right after that, in conjunction with the m flag, matches the end of the line.

You can also avoid the (?: ) trickery by using * instead of + quantifiers, meaning "match zero or more times": change \s+(\S+) to \s*(\S*). This has the side effect that the space between the number and the data that follows it is now optional.

I would rewrite the regex and refactor the code a bit as follows:

while (match = CST.REGEX.POST.exec(/^#(start|end|p)(?:\s+(\d+)(?:[^\S\r\n]+([^\r\n]+))?)?$/igm)) {
  switch (match[1]) {
    case 'start':
      obj = {};
      break;
    case 'end':
      results.push(obj);
      obj = null;
      break;
    case 'p':
      obj[match[2]] = match[3];
      break;
  }
}

I like capturing start, end, or p in the one capture group so I can use it in a switch statement. The version of the regex I use here is a little more discriminating (expects the token that follows #p to be numeric) and a little more forgiving (allows the last token on a #p line to contain any non-linebreak whitespace, e.g. #p 1138 this is only a test).

edited Apr 9, 2014 at 22:16

answered Apr 9, 2014 at 16:12

tgies

7054 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Johan Over a year ago

Great, thanks! Final question: How do I allow the last group to contain spaces and digits e.g. #p 12345 foo can also be #p 12345 foo bar 1, where foo bar 1 would be the last group. Too tricky perhaps?

tgies Over a year ago

That's doable. You need to change the last optional noncapturing group to (?:[^\S\r\n]+([^\r\n]+))?). That means "anything that's not not whitespace and not a linebreak (= any space other than linebreaks) any number of times, followed by any character other than a linebreak any number of times." I'm editing the answer in a moment.

Collectives™ on Stack Overflow

Parsing text to object with regex

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related