0

I would like to extract items from this sample html, more specificly, i would like to isolate the following ones: algp1, PRODUCTION 50733 GEN_APPL KANTOOR

<table width="95%" border="1">
<tr><td colspan=3><a name="algp1"></a><img src="menu/db2inst.jpg">  <font color="#FF0000" size="+1">algp1</font> (PRODUCTION, 50733)</td></tr>
<tr><td width="20%" valign=top><a name="GENAPPLP"></a><img src="menu/db2db.jpg"><font color="#00CC00"><b> GEN_APPL</font></b><br>(GENAPPLP)</td><td width="15%" valign=top>PARK</td><td width="70%" valign=top><font size="2">BOOKINGCARPARKING&sbquo; CUSTOMERS&sbquo; </font></td></tr>
<tr><td width="20%" valign=top></td><td width="15%" valign=top>RDC</td><td width="70%" valign=top><font size="2">DBREL_SCHEMA_RDCPROJECT&sbquo; DBVERSION&sbquo; </font></td></tr>
<tr><td width="20%" valign=top><a name="KANTOORP"></a><img src="menu/db2db.jpg"><font color="#00CC00"><b> KANTOOR</font></b><br>(KANTOORP)</td><td width="15%" valign=top>CDDB</td><td width="70%" valign=top><font size="2">BATIMENTS&sbquo; BATIMENTS_EXC&sbquo; OFFICES&sbquo; OFFICES_EXC&sbquo; RECETTES&sbquo; RECETTES_EXC&sbquo; </font></td></tr>
<tr><td width="20%" valign=top></td><td width="15%" valign=top>IDR</td><td width="70%" valign=top><font size="2">ADMINISTRATION&sbquo; ADMINISTRATION_EXC&sbquo; ARROND&sbquo; ARROND_EXC&sbquo; BUREAU&sbquo; BUREAU_EXC&sbquo; CODEX&sbquo; CODEX_EXC&sbquo; COMMUNE&sbquo; COMMUNE_EXC&sbquo; COMPETENCE&sbquo; COMPETENCE_EXC&sbquo; COMPTE&sbquo; COMPTE_EXC&sbquo; LNKBCC&sbquo; LNKBCC_EXC&sbquo; LNKBCI&sbquo; LNKBCI_EXC&sbquo; LNKBPC&sbquo; LNKBPC_EXC&sbquo; LNKBS&sbquo; LNKBS_EXC&sbquo; LNKCBRR&sbquo; LNKCBRR_EXC&sbquo; LNKCS&sbquo; LNKCS_EXC&sbquo; MAP_CP_BUREAU&sbquo; PAYS&sbquo; PAYS_EXC&sbquo; PROVINCE&sbquo; PROVINCE_EXC&sbquo; RANGE_RUE&sbquo; RANGE_RUE_EXC&sbquo; REGION&sbquo; REGION_EXC&sbquo; RUE&sbquo; RUE_EXC&sbquo; SERVICE&sbquo; SERVICE_EXC&sbquo; TPCODEX&sbquo; TPCODEX_EXC&sbquo; TPCOMPTE&sbquo; TPCOMPTE_EXC&sbquo; </font></td></tr>
<tr><td width="20%" valign=top></td><td width="15%" valign=top>RDC</td><td width="70%" valign=top><font size="2">DBREL_SCHEMA_RDCPROJECT&sbquo; DBVERSION&sbquo; </font></td></tr>
</table>
1
  • Better use a real parser if possible. Commented Sep 11, 2009 at 9:39

2 Answers 2

2

Check out JTidy. It will parse the HTML and give you a DOM interface to iterate over.

I would strongly recommend not using a regexp for all but the simplest cases. HTML isn't regular and has no end of edge cases to trip you up.

Sign up to request clarification or add additional context in comments.

2 Comments

+1, avoid RegEx the most, because using it you'll have 2 problems. And please, do a little search before posting, there is a billion times this question stackoverflow.com/questions/299942/… and stackoverflow.com/questions/181095/… and so on.
stackoverflow.com/questions/26638/… offers some libraries that do the tidying and parsing. But your answer put me in the right direction. Thank you.
0

Take a look at regulazy...

It will allow you to create a regexp from an input string with a simple point and click interface.

http://osherove.com/tools/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.