1

I'm trying to extract a value from a multiline pattern with PHP and preg_match. The pattern I'm searching for within the string I'm passing to preg_match($regex, $string, $the_match):

Latitude:</td>
        <td class="formCell">
        40-45-40.205 N
       </tr>

I know that if it were all on one line like so:

Latitude:</td><td class="formCell">40-45-40.205 N</tr>

Then the following would be valid and it would properly extract the value:

/Latitude:<\/td><td class="formCell">(.*?)<\/tr>/

However, since the pattern I'm looking for has multiple lines the above regex doesn't work. I'm getting the initial string I'm passing to preg_match() via file_get_contents($url) so I'm at the mercy of the remote content to some extent. Any help would be much appreciated!

1
  • Full answer: /Latitude:<\/td>[\s]*<td class="formCell">[\s]*([\s\S]*?)[\s]*<\/tr>/ Commented Jul 5, 2012 at 23:33

3 Answers 3

6

Use [\s\S] instead of ..

/Latitude:<\/td>[\s]*<td class="formCell">([\s\S]*?)<\/tr>/

. is a wildcard but does not include whitespace - including line break - characters. [\s\S] simply says "match all space and non-space characters" (i.e. anything at all).

Note I also allowed for optional space characters after </td>.

(Sidenote: the HTML is invalid - closing a table row before closing the table cell.)

Sign up to request clarification or add additional context in comments.

4 Comments

Winner winner chicken dinner!! (almost) I had to add another "[\s]*" after the <td class="formCell">. Like the following: /Latitude:<\/td>[\s]*<td class="formCell">[\s]*([\s\S]*?)<\/tr>/
Hmm... this should not be necessary because any space characters there would be picked up by the sub-group, which accepts space characters. It worked for me as it was.
Update: I also had to add [\s]* right after where the value is being extracted. /Latitude:<\/td>[\s]*<td class="formCell">[\s]*([\s\S]*?)[\s]*<\/tr>/
If I didn't add those [\s]* before and after where I'm extracting the value then the line breaks get passed in with the value being extracted. You can test it real quick over at Rubular.
0

There is no simple flag for this. A simple hack could be:

Latitude:(.*?)<\/td>(.*?)<td class="formCell">(.*?)<\/tr>

And then add the dotall flag to your regex (s) to allow a '.'[dot] to match newlines also. But then it could match a lot more. Is it your own code or are you ripping html from a 3rd party website? Because maybe you are using regex' when you don't have to!

1 Comment

I tried it and it didn't work. @Utkanos, for the most part, was spot on.
0

I think the trick is to "sprinkle" [\s]* anywhere the HTML formal would legally allow whitespace. You do not need special flags or anything.

Latitude:[\s]*<\/td>[\s]*<td[\s]*class="formCell">[\s]*([\s\S]*?)[\s]*<\/tr>

Keep in mind that html is VERY forgiving about whitespace. You need to evaluate your input and decide what is acceptable tolerance for you.

Another caveat is that these elements may have different attributes, or different quote styles... If you must work with that as well, you will need to use more of . and then use the "unready" flag (add u after the pattern when passing it to the preg functions); and then perhaps some fancy back-referencing once you realize that > can legally occur inside of an attribute ;-)

2 Comments

During my initial testing I was using \s without the square brackets and asterisk. I was using Rubular for testing and in the quick reference section they never mentioned wrapping the \s in square brackets. All the other places that searched for help never mentioned the extra characters either.
I suppose they aren't strictly necessary (it creates a "class" of characters, but with only one member it's perhaps overkill). The asterisk is the critical part of my suggestion, and the plentiful placement.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.