I am building a simple parser and I have trouble getting my head around the general design. What would be the best practice?
The parser takes a simple text file and structures it into a HTML file, which would make heavy use of nested lists and adds an index and an ID per list item.
The input (indentation added for clarity).
A. First section with random name
Article 1
Spam and eggs and some more
Article 2
1. The first member
2. The second member
3. The final member
B. Second section called whatever
Article 3
This one has no members but it does contain subs
a. item 1
b. item 2
Article 4
1. A member
2. A member with subs
a. sub 1 here
b. sub 2 here
c. final sub
C. Another section
etc
I have the regexes to find the various list items, with line numbers (right now I am using a lexer, but that might be overkill, right?)
As I said, I need to make nested HTML lists, with an ID per list item. How would you, in your experience, represent the structure of the document?
As a series of tuples or dictionaries, with per item the ( id , line-number ):
list_section = ( ('A',1), ('B',8), ('C',18), ... )
list_article = ( ('1',2), ('2',4), ('3',9), ('4',13), ... )
list_member = ( ('2-1',5), ('2-2',6), ('2-3',7), ('4-1',14), ...)
etc
Or as nested tuples, where every token has ( TYPE , id , line-number ):
(('SECTION','A',1 ,
('ARTICLE','1',2),
('ARTICLE','2',4 ,
('MEMBER','2-1',5),
('MEMBER','2-2',6),
('MEMBER','2-3',7)
)
)
Right now I am leaning towards the second option. The first one will be easier to build and iterated, but the hierarchy can only be inferred from looking at surrounding line numbers.
Would you do it this way, or in a different way altogether? I am not asking you to write my parser or regexes, I am just looking for sound advise on best-practices.
I added the required output in HTML. The Index:
<div id="index">
<ol class="indexlist sections">
<li><a href="#listref_A">First section with random name</a><br>
Article 1 - 2</li>
<li><a href="#listref_B">Second section called whatever</a><br>
Artikel 3 - 4</li>
<li><a href="#listref_C">Another section</a><br>
Article 5</li>
</ol>
And the content:
<div id="content">
<ol class="sections">
<li id="listref_D"><h2></h2>
<ol class="articles">
<li id="listref_8">Article 8
<ol class="members">
<li id="listref_8-1">Member 1.</li>
<li id="listref_8-2">Member 2</li>
<li id="listref_8-3">Member 3</li>
<li id="listref_8-4">Member 4.</li>
</ol>
</li>
</ol>
</li>
<li id="listref_E">Section E
<ol class="articles">
<li id="listref_9">Article 9
<ol class="members">
<li id="listref_9-1">Member 1 has subs:
<ol class="subs">
<li id="listref_9-1-a">sub a;</li>
<li id="listref_9-1-b">sub b;</li>
<li id="listref_9-1-c">sub c.</li>
</ol>
</li>
<li id="lijstref_9-2">Member 2, refers to <a href="#listref_8-2">article 8 sub 2</a>.</li>
</ol>