1

I have a web bot which extracts some data from a website. The problem is that the html content is sent without line brakes so it's a little bit harder to match certain things so I need to extract everything that is between td tags. Here's a string example:

<a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a>&nbsp; (<b><font color="#3300cc">Used</font></b>)</td><td><a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a>&nbsp; (<b><font color="#3300cc">Used</font></b>)</td>

And my regex so far:

<a\s+class="a"\s+href="javascript:ow\((.*?)\)">.+</a>(?!<td>).+</td>

But my regex matches the whole line instead of matching all contents. Any ideas?

1

3 Answers 3

2

Don't waste your time on regexes. Use DOM and XPath.

 DOMDocument::loadHTML($html)->getElementsByTagName('a')
Sign up to request clarification or add additional context in comments.

Comments

1

Have you tried changing .+ to .+? ?

Comments

0

Can you determine where the proper line breaks SHOULD be? If so, it might be easier to first replace those tokens with a proper line break and then use the pattern you have (assuming that pattern works - I haven't tried it).

Your pattern looks VERY specific, but perhaps it works fine for what you are doing.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.