0

I am working on some code to parse text into XML. I am currently using java and jaxb to handle the XML and the in-program representation of my data. I need to setup an easily expandable and adaptable method to parse the info from my text files into my java classes. The data will for the most part stay the same, but I need to be able to support later changes in the text input format. (I am parsing airline pilot flight schedules, and I want to support the schedules of other airlines down the road.) It seems like regular expressions are the way to go, but the little I have worked with java RE makes it seem a poor solution compared to python - named captures specifically. But, I know less about python than I do about Java!

So, I am looking for a modular system to parse text data that I can easily adapt, extend, and distribute later on. I am willing to learn more python if I need it, but my time and abilities are limited. Any suggestions? An example of the text I am parsing follows.

=================================================================================================
 8122 TU             REPORT AT 06.45/N             EFFECTIVE JUN 08-JUN 29
      1 CAPT, 1 F/O
   DAY  FLT.  EQP DEPARTS   ARRIVES    BLK.   BLK.  DUTY   CR.     LAYOVER   MO TU WE TR FR SA SU
   TU   180   320 PHX 0745  SAN 0857* 1.12                                      -- -- -- -- -- --
   TU   005   320 SAN 0950  PHX 1106  1.16                                   --  8 -- -- -- -- --
   TU   592 L 320 PHX 1215  MCI 1652  2.37                                   -- 15 -- -- -- -- --
             Radisson A/P                     5.05  8.22  5.05  MCI  12.18   -- 22 -- -- -- -- --
             (816) 464-2423                                                  -- 29 --            
   WE   403 B 320 MCI 0610  PHX 0657  2.47                                  
   WE   149   320 PHX 0859  CMH 1547  3.48                                  
             Holiday Inn City Center          6.35  9.37  6.35  CMH  15.13  
             (614) 221-3281                                        
   TH   335 B 320 CMH 0800  PHX 0913  4.13                                  
   TH   343 L 320 PHX 1029  PVR 1508  2.39                                  
             Marriott Casamagna               6.52  9.23  6.52  PVR  15.52  
             52-322-2260000 TRANS: Hotel Shuttle                   
   FR   621   320 PVR 0815  PHX 0839  2.24                                  
                                              2.24  3.39  2.24              
      CREDIT HRS.  21.00     BLK. HRS. 20.56    LDGS:  8     TAFB    74.24  
=================================================================================================
4
  • Could you also provide the XML that you would want to be generated from your example text? Commented May 20, 2011 at 16:29
  • 2
    "I am looking for a modular system to parse text data that I can easily adapt, extend, and distribute later on" -- this is called "Perl" ;) Commented May 20, 2011 at 16:43
  • There are many many tools for parsing things. The simplest way would be simple string manipulation (splits etc). Going up in complexity and flexibility are probably: regexes, a Python library like pyparsing, and a parser generator like ANTLR. You'll need to decide which of these to use based on how much work you want to do. Commented May 20, 2011 at 17:30
  • @Stephen - I admit to being a bit intimidated by perl. So much of it looks like it was written at 3am by a cartoon programmer with Tourettes. :) @Andrew jaxb makes the xml part fairly simple, ill try to post a sample when I get back to my "work" machine. @katrielex - I guess I'm just spoiled for choice. It looks like Python and Pyparsing might be the way to go. Now, if I could make it happen with no work at all.... :) Commented May 23, 2011 at 3:11

3 Answers 3

2

Those look like fixed-width fields, which are probably a good choice for simple string splitting. The only thing it looks like you could use regular expressions for is to determine what type of record you are looking at, although it looks like the indentation level is also useful for determining that.

Sign up to request clarification or add additional context in comments.

Comments

0

You should be fine with java regular expressions and it should be a trivial exercise to support named captures. After all it is just mapping capture group numbers to names. I even have code for this around somewhere, but can't share due to copyright reasons.

You could put regular expressions to parse the individual parts of such listings in a text file and make those part of your configuration. Regular expressions are compiled at run-time, so this should be fairly dynamic.

If you want a more flexible system (albeit at the cost of a pre-compilation step), have a look at parser generators like JavaCC or ANTLR. These allow you to create context-free grammars which are considerably more powerful than regexp.

1 Comment

Thanks for the suggestions. I was hoping there was some tool partway between roll your own and Antlr that I hadn't come across yet. My brain space is limited and ANTRL looks like overkill for my problem. I think I'll probably wind up going with Python and pyparsing. I was trying to stay with java for the easy gui and tools (netbeans) but Python seems a better fit for the text manipulations I'll be doing.
0

In Python, you could try Gelatin.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.