1

Possible Duplicates:
(PHP5) Extracting a title tag and RSS feed address from HTML using PHP DOM or Regex
Grabbing title of a website using DOM

I am trying to run through a hundred different html files on my server, and extract the titles for use in another php file.

For reference:

    <title>Generic Test Page</title>

What I need is a function that will return the string "Generic Test Page" and stick that into a global variable.

What I am doing right now is simply reading the file into an array called $lines. Foreach $lines as $line, I am testing for the string < title> ... but how do I extract only what's between the > and < /title?

My trouble is that sometimes the original developer decided to elaborate on the title: < title name=title class=title1>, or he put it on three lines instead of one. What in the world? So I can't just strip the first seven characters and the last eight characters. Which would be so nice...

Thank you!!

6

3 Answers 3

4

You need to use something like PHP Simple Dom Parser

function get_page_title($html_file) {
  $html = file_get_html($html_file);
  $title = $html->find('title', 0)->plaintext;
  return $title;
}
Sign up to request clarification or add additional context in comments.

4 Comments

Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
Awesome! I wasn't aware of all of those alternatives, so that gives me something to compare against what I use currently.
You're welcome. Also see the related link given below the question.
I figured out the DOM coding. Thank you so much for your help!
2

$line = each line.

 $pattern ='/<title[^>]*>(.*?)<\/title>/is';
 if( preg_match($pattern,$line,$match) )
   return trim($match[1]); # your title !

or just use the pattern on the whole html and return the match.

or use something scurker has suggested.

6 Comments

Please will you tell me what all the slashes and stars and parentheses mean? and do you need to define $match as an array, or is it automatically an array when it's stuck in as an argument?
Sure : * means zero or more , / is a function in a expression so you put \ in front to accept it ( \/ ) , [^>]* = means get all characters which are not > ( so in <title [sdgsdsdg sd..sdgsdgsd]> "sdgsdsdg sd..sdgsdgsd" would get eliminated. check out some tutorials : regular-expressions.info/tutorial.html
$match is just the name I gave to the array to store the "matches". You can name it whatever you want in the function : preg_match($pattern,$source,$ARRAY WITH RESULTS); it is always good to define the array before ( $match=array() ) see >nl3.php.net/manual/en/function.preg-match.php
This worked when the title was all on one line, but when the developer split it into three lines, the code broke. I think I just figured out why not to use regex. :-) Thank you for giving me a starting point and for telling me what everything means. You're very helpful.
Could you provide an example ? regex will do but the pattern probably has to change.
|
0

You should use a regular expression to extract the inner part. More info here

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.