How to structure a parser

Question

I am building a simple parser and I have trouble getting my head around the general design. What would be the best practice?

The parser takes a simple text file and structures it into a HTML file, which would make heavy use of nested lists and adds an index and an ID per list item.

The input (indentation added for clarity).

A. First section with random name
  Article 1
  Spam and eggs and some more
  Article 2
    1. The first member
    2. The second member
    3. The final member
B. Second section called whatever
  Article 3
  This one has no members but it does contain subs
    a. item 1
    b. item 2
  Article 4
    1. A member
    2. A member with subs
      a. sub 1 here
      b. sub 2 here
      c. final sub
C. Another section
etc

I have the regexes to find the various list items, with line numbers (right now I am using a lexer, but that might be overkill, right?)

As I said, I need to make nested HTML lists, with an ID per list item. How would you, in your experience, represent the structure of the document?

As a series of tuples or dictionaries, with per item the ( id , line-number ):

list_section = ( ('A',1), ('B',8), ('C',18), ... )
list_article = ( ('1',2), ('2',4), ('3',9), ('4',13), ... )
list_member = ( ('2-1',5), ('2-2',6), ('2-3',7), ('4-1',14), ...)
etc

Or as nested tuples, where every token has ( TYPE , id , line-number ):

(('SECTION','A',1 , 
    ('ARTICLE','1',2),
    ('ARTICLE','2',4 ,
        ('MEMBER','2-1',5),
        ('MEMBER','2-2',6),
        ('MEMBER','2-3',7)
    )
 )

Right now I am leaning towards the second option. The first one will be easier to build and iterated, but the hierarchy can only be inferred from looking at surrounding line numbers.

Would you do it this way, or in a different way altogether? I am not asking you to write my parser or regexes, I am just looking for sound advise on best-practices.

I added the required output in HTML. The Index:

<div id="index">
    <ol class="indexlist sections">
        <li><a href="#listref_A">First section with random name</a><br>
            Article 1 - 2</li>
        <li><a href="#listref_B">Second section called whatever</a><br>
            Artikel 3 - 4</li>
        <li><a href="#listref_C">Another section</a><br>
            Article 5</li>
    </ol>

And the content:

<div id="content">
    <ol class="sections">
        <li id="listref_D"><h2></h2>
        <ol class="articles">
            <li id="listref_8">Article 8
                <ol class="members">
                    <li id="listref_8-1">Member 1.</li>
                    <li id="listref_8-2">Member 2</li>
                    <li id="listref_8-3">Member 3</li>
                    <li id="listref_8-4">Member 4.</li>
                </ol>
            </li>
        </ol>
    </li>
    <li id="listref_E">Section E
        <ol class="articles">
            <li id="listref_9">Article 9
                <ol class="members">
                    <li id="listref_9-1">Member 1 has subs:
                        <ol class="subs">
                            <li id="listref_9-1-a">sub a;</li>
                            <li id="listref_9-1-b">sub b;</li>
                            <li id="listref_9-1-c">sub c.</li>
                        </ol>
                    </li>
                    <li id="lijstref_9-2">Member 2, refers to <a href="#listref_8-2">article 8 sub 2</a>.</li>
                </ol>

you should google "python parser generator" and determine if this is an option for you. also, dont bother about the syntax highlighting, we can all ignore it. you could make it a quote instead of code, but dont waste your time. — mnagel
– mnagel, Commented Sep 12, 2013 at 14:19
I've never had occasion to use it, but I always hear that PyParsing is excellent for this sort of thing. — Daniel Roseman
– Daniel Roseman, Commented Sep 12, 2013 at 14:23
@DanielRoseman From the looks of it, PyParser is more like a 'lexer'. I think I have that part covered. I don't mind the exercise to write a case-specific parser: would be a good learning experience. — Roy Prins
– Roy Prins, Commented Sep 12, 2013 at 14:32
Another resource is the <a href="dabeaz.com/ply/">ply</a> library, as used in the Udacity.com Python-based course <a href="udacity.com/course/cs262">Programming Languages</a> — Mike Housky
– Mike Housky, Commented Sep 12, 2013 at 14:34
The overall document structure is an ordered tree, and nested sequences aren't a bad way to represent that. You might find yourself wanting to define classes to represent the tree and its nodes, though, so that you can provide direct access to individual nodes based on keys or paths, and automatically maintain any auxiliary data structures (like a dict for access by id= attribute) if those end up being needed. — Mike Housky
– Mike Housky, Commented Sep 12, 2013 at 14:41

thejava · Accepted Answer · 2013-12-30 22:56:22Z

Try an ANTLR Lexer/Parser combo. All you need is little more than regular expressions to generate a lexer/basic parser combo. It uses a strategy similar to BNF grammars and you can define actions to print out to console or file very easily. It outputs Java by default, but ANTLR 4 also outputs to C#. ANTLR 3 can output to several more languages such as Ruby.

To generate part of the lexer you might do something like

 // Define Tokens
 WS : [ \t\r\n] ~> skip;
 DOT : '.';
 ARTICLE : 'Article';

 fragment DIGIT : [0-9];
 fragment ALPHA : [a-zA-Z];

 AlphaString : ALPHA+;
 Number : DIGIT+;
 AlphaNumericString : (AlphaString | Number)+;

 // Define Lexer and Parser Grammars
 SectionTitle : AlphaString;
 SectionHeader : SectionTitle DOT AlphaNumericString;

 ArticleHeader : ARTICLE Number;

 MemberTitle : Number;
 MemberHeader : MemberTitle DOT AlphaNumericString;

 submember : /*Code to define submember*/;     

 member : MemberHeader submember+;

 article : ArticleHeader (member | AlphaNumericString)+;

 section : SectionHeader
           (article | AlphaNumericString)+;

Obviously this isn't a comprehensive grammar, but it displays the basics. A good reference is the ANTLR 4 Documentation Wiki and ANTLR 4: The Defenitive Guide. These show you how to do these grammars and how to embed actions in them. They're both good guides for small or big projects. Chapter 2 and 3 of the latter showcase the basics you'll need in a simple manner with good examples.

Thanks for this, is it ok if I come back to this a bit later?

Collectives™ on Stack Overflow

How to structure a parser

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related