I am working on building a lexical analyzer for a fictional XML-style language and I'm currently trying to turn the following lexical specification into Java code:
Name -> Initial Other*
Initial -> Letter | _ | :
Other -> Initial | Digit | - | .
String -> " (Char | ')* " | '(Char | ")* '
Data -> Char+
Char -> Ordinary | Special | Reference
Ordinary -> NOT (< | > | " | ' | &)
Special -> < | > | " | ' | &
Reference -> &#(Digit)+; | &#x(Digit|a...f|A...F)+;
Letter -> a...z | A...Z
Digit -> 0...9
I'm no expert, but I do know I have to use regular expressions for these. So my Tokenizer now looks like this:
public Tokenizer(String str) {
this.tokenContents = new ArrayList<TokenContent>();
this.str = str;
// Name = Initial Other*
String initial = "[a-zA-Z] | _ | :";
String other = initial + " | [0-9] | - | \\.";
String name = initial + "(" + other + ")*";
tokenContents.add(new TokenContent(Pattern.compile(name), TokenType.NAME));
// String = " " (Char | ')* " | ' (Char | ")* '
String ordinary = "(?!(< | > | \" | ' | &))";
String special = "< | > | " | ' | &";
String reference = "&#[0-9]+; | &#x([0-9] | [a-fA-F])+;";
String character = ordinary + " | " + special + " | " + reference;
String string = "\"(" + character + " | " + "')* \" | ' (\"" + character + " | " + "\")* '";
tokenContents.add(new TokenContent(Pattern.compile(string), TokenType.STRING));
// Data = Char+
String data = character + "+";
tokenContents.add(new TokenContent(Pattern.compile(data), TokenType.DATA));
// The symbol <
tokenContents.add(new TokenContent(Pattern.compile("<"), TokenType.LEFT_TAG));
// The symbol >
tokenContents.add(new TokenContent(Pattern.compile(">"), TokenType.RIGHT_TAG));
// The symbol </
tokenContents.add(new TokenContent(Pattern.compile("</"), TokenType.LEFT_TAG_SLASH));
// The symbol />
tokenContents.add(new TokenContent(Pattern.compile("/>"), TokenType.RIGHT_TAG_SLASH));
// The symbol =
tokenContents.add(new TokenContent(Pattern.compile("="), TokenType.EQUALS));
}
For simplicity, you can see I have modularized my regular expressions according to the specification above. However, after several test cases of running the lexer on an example input file, I get parsing errors. I believe it might be my regular expressions, so I would like some suggestions on how I can correctly translate the above specification into code and fix my Tokenizer.
My tokens are Name, String, Data, <, >, </, />, and =. They are all specified in an enum class that isn't displayed here. An example input file is:
<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
<title>Basic bread</title>
<ingredient amount="3" unit="cups">Flour</ingredient>
<ingredient amount="0.25" unit="ounce">Yeast</ingredient>
<ingredient amount="1.5" unit="cups" state="warm">Water</ingredient>
<ingredient amount="1" unit="teaspoon">Salt</ingredient>
<instructions>
<step>Mix all ingredients together.</step>
<step>Knead thoroughly.</step>
<step>Cover with a cloth, and leave for one hour in warm room.</step>
<step>Knead again.</step>
<step>Place in a bread baking tin.</step>
<step>Cover with a cloth, and leave for one hour in warm room.</step>
<step>Bake in the oven at 350° F for 30 minutes.</step>
</instructions>
</recipe>
I've never worked with regular expressions much before so this is a first for me. I would really appreciate any input that could help.
Char -> Ordinary | Special | Reference; Ordinary -> NOT (< | > | " | ' | &)That is a grammar. A lexical specification matches regular expressions to token types ONLY.