Extracting string between <title> and </title> using PHP [duplicate]

Question

Possible Duplicates:
(PHP5) Extracting a title tag and RSS feed address from HTML using PHP DOM or Regex
Grabbing title of a website using DOM

I am trying to run through a hundred different html files on my server, and extract the titles for use in another php file.

For reference:

    <title>Generic Test Page</title>

What I need is a function that will return the string "Generic Test Page" and stick that into a global variable.

What I am doing right now is simply reading the file into an array called $lines. Foreach $lines as $line, I am testing for the string < title> ... but how do I extract only what's between the > and < /title?

My trouble is that sometimes the original developer decided to elaborate on the title: < title name=title class=title1>, or he put it on three lines instead of one. What in the world? So I can't just strip the first seven characters and the last eight characters. Which would be so nice...

Thank you!!

A solution can be found here - stackoverflow.com/questions/3054347/… — Jason McCreary
– Jason McCreary, Commented May 10, 2011 at 18:55
I would gladly use preg_match or preg_split, but I can't figure out where all the extra characters came from. For example, why doesn't preg_split(">", $line) return an array with two parts, the first before the > and the other after the >. It keeps telling me that it can't find the delimiter. Ugh... — EllaJo
– EllaJo, Commented May 10, 2011 at 19:34
Okay, apparently I'm not supposed to do that. I see lots of complaints, but why is it bad? — EllaJo
– EllaJo, Commented May 10, 2011 at 19:43
Have you not tried searching for an answer? This issue has been addressed several times: stackoverflow.com/questions/138313/… stackoverflow.com/questions/2988055/… stackoverflow.com/questions/3195851/how-to-extract-a-page-title — Strong Like Bull
– Strong Like Bull, Commented May 10, 2011 at 19:57

scurker · Accepted Answer · 2011-05-10 18:57:19Z

4

You need to use something like PHP Simple Dom Parser

function get_page_title($html_file) {
  $html = file_get_html($html_file);
  $title = $html->find('title', 0)->plaintext;
  return $title;
}

answered May 10, 2011 at 18:57

scurker

4,8031 gold badge28 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Gordon Over a year ago

Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

scurker Over a year ago

Awesome! I wasn't aware of all of those alternatives, so that gives me something to compare against what I use currently.

Gordon Over a year ago

You're welcome. Also see the related link given below the question.

EllaJo Over a year ago

I figured out the DOM coding. Thank you so much for your help!

Paolo_Mulder · Accepted Answer · 2011-05-10 19:33:04Z

2

$line = each line.

 $pattern ='/<title[^>]*>(.*?)<\/title>/is';
 if( preg_match($pattern,$line,$match) )
   return trim($match[1]); # your title !

or just use the pattern on the whole html and return the match.

or use something scurker has suggested.

answered May 10, 2011 at 19:33

Paolo_Mulder

1,2891 gold badge18 silver badges28 bronze badges

6 Comments

EllaJo Over a year ago

Please will you tell me what all the slashes and stars and parentheses mean? and do you need to define $match as an array, or is it automatically an array when it's stuck in as an argument?

Paolo_Mulder Over a year ago

Sure : * means zero or more , / is a function in a expression so you put \ in front to accept it ( \/ ) , [^>]* = means get all characters which are not > ( so in <title [sdgsdsdg sd..sdgsdgsd]> "sdgsdsdg sd..sdgsdgsd" would get eliminated. check out some tutorials : regular-expressions.info/tutorial.html

Paolo_Mulder Over a year ago

$match is just the name I gave to the array to store the "matches". You can name it whatever you want in the function : preg_match($pattern,$source,$ARRAY WITH RESULTS); it is always good to define the array before ( $match=array() ) see >nl3.php.net/manual/en/function.preg-match.php

EllaJo Over a year ago

This worked when the title was all on one line, but when the developer split it into three lines, the code broke. I think I just figured out why not to use regex. :-) Thank you for giving me a starting point and for telling me what everything means. You're very helpful.

Paolo_Mulder Over a year ago

Could you provide an example ? regex will do but the pattern probably has to change.

|

Brian Geihsler · Accepted Answer · 2011-05-10 18:57:45Z

0

You should use a regular expression to extract the inner part. More info here

answered May 10, 2011 at 18:57

Brian Geihsler

2,08713 silver badges15 bronze badges

Collectives™ on Stack Overflow

Extracting string between <title> and </title> using PHP [duplicate]

3 Answers 3

4 Comments

6 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

6 Comments

Comments

Linked

Related