Using Regex in Lexical Analyzer (Java)

Question

I am working on building a lexical analyzer for a fictional XML-style language and I'm currently trying to turn the following lexical specification into Java code:

Name -> Initial Other*
Initial -> Letter | _ | :
Other -> Initial | Digit | - | .
String -> " (Char | ')* " | '(Char | ")* '
Data -> Char+
Char -> Ordinary | Special | Reference
Ordinary -> NOT (< | > | " | ' | &) 
Special -> &lt; | &gt; | &quot; | &apos; | &amp;
Reference -> &#(Digit)+; | &#x(Digit|a...f|A...F)+;
Letter -> a...z | A...Z
Digit -> 0...9

I'm no expert, but I do know I have to use regular expressions for these. So my Tokenizer now looks like this:

public Tokenizer(String str) {
    this.tokenContents = new ArrayList<TokenContent>();
    this.str = str;

    // Name = Initial Other*
    String initial = "[a-zA-Z] | _ | :";
    String other = initial + " | [0-9] | - | \\.";
    String name = initial + "(" + other + ")*";
    tokenContents.add(new TokenContent(Pattern.compile(name), TokenType.NAME));
    // String = " " (Char | ')* " | ' (Char | ")* '
    String ordinary = "(?!(< | > | \" | ' | &))";
    String special = "&lt; | &gt; | &quot; | &apos; | &amp;";
    String reference = "&#[0-9]+; | &#x([0-9] | [a-fA-F])+;";
    String character = ordinary + " | " + special + " | " + reference;
    String string = "\"(" + character + " | " + "')* \" | ' (\"" + character + " | " + "\")* '";
    tokenContents.add(new TokenContent(Pattern.compile(string), TokenType.STRING));
    // Data = Char+
    String data = character + "+"; 
    tokenContents.add(new TokenContent(Pattern.compile(data), TokenType.DATA)); 
    // The symbol <
    tokenContents.add(new TokenContent(Pattern.compile("<"), TokenType.LEFT_TAG));
    // The symbol >
    tokenContents.add(new TokenContent(Pattern.compile(">"), TokenType.RIGHT_TAG));
    // The symbol </
    tokenContents.add(new TokenContent(Pattern.compile("</"), TokenType.LEFT_TAG_SLASH));
    // The symbol />
    tokenContents.add(new TokenContent(Pattern.compile("/>"), TokenType.RIGHT_TAG_SLASH));  
    // The symbol = 
    tokenContents.add(new TokenContent(Pattern.compile("="), TokenType.EQUALS));    
}

For simplicity, you can see I have modularized my regular expressions according to the specification above. However, after several test cases of running the lexer on an example input file, I get parsing errors. I believe it might be my regular expressions, so I would like some suggestions on how I can correctly translate the above specification into code and fix my Tokenizer.

My tokens are Name, String, Data, <, >, </, />, and =. They are all specified in an enum class that isn't displayed here. An example input file is:

<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
   <title>Basic bread</title>
   <ingredient amount="3" unit="cups">Flour</ingredient>
   <ingredient amount="0.25" unit="ounce">Yeast</ingredient>
   <ingredient amount="1.5" unit="cups" state="warm">Water</ingredient>
   <ingredient amount="1" unit="teaspoon">Salt</ingredient>
   <instructions>
     <step>Mix all ingredients together.</step>
     <step>Knead thoroughly.</step>
     <step>Cover with a cloth, and leave for one hour in warm room.</step>
     <step>Knead again.</step>
     <step>Place in a bread baking tin.</step>
     <step>Cover with a cloth, and leave for one hour in warm room.</step>
     <step>Bake in the oven at 350&#x00B0; F for 30 minutes.</step>
   </instructions>
</recipe>

I've never worked with regular expressions much before so this is a first for me. I would really appreciate any input that could help.

Please post some of the test cases that you're trying to parse. — 11thdimension
– 11thdimension, Commented Sep 11, 2016 at 4:17
You are mixing Lexical Analysis with Parsing. Lexical Analysis would not have productions depending on other productions, such as Char -> Ordinary | Special | Reference; Ordinary -> NOT (< | > | " | ' | &) That is a grammar. A lexical specification matches regular expressions to token types ONLY. — Jim Garrison
– Jim Garrison, Commented Sep 11, 2016 at 4:42
Using regular expressions for lexers is common, using java-Patterns is not. Java Patterns are based on non-deterministic finite automata (NFA) and this makes lexing very expensive. You may be interested in lexing with a DFA-regular expression, e.g. rexlex. — CoronA
– CoronA, Commented Sep 11, 2016 at 4:58

sepp2k · Accepted Answer · 2016-09-11 08:17:26Z

String ordinary = "(?!(< | > | \" | ' | &))";

This pattern won't do what you want it to. Lookahead is a feature that is used to make a pattern match only if it's followed (or, in the case of negative lookahead as you use here, not followed) by a specific pattern. The lookahead itself does not consume any input.

Take for example the pattern [a-z]+(?=\s). This will match a sequence of letters that are followed by a whitespace, but not the whitespace itself. So the pattern would match the "abc" in "abc def" and would not match anything in "abc_def". But either way the match would not include the space. If you use this in a tokenizer (that also has a rule for whitespace), this will cause "abc def " to be tokenized as "abc", " ", "def", " ", rather than "abc ", "def ". So that's useful.

But in your case your entire pattern is lookahead. So if you tokenized something using your rule, the result would look more like "", "", ... ad infinitum. That's less useful.

What you want is a negative character class, which is created using [^...] where the ... is a list of characters or character ranges as you'd use with a normal character class. It matches exactly one character as long as that character is not in the specified list. Using this, your regex would look like this:

String ordinary = "[^<>\"'&]";

Collectives™ on Stack Overflow

Using Regex in Lexical Analyzer (Java)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related