4


I'm working on script based on "Simple HTML DOM" and I want to detect string's charset after getting inner text of URL to convert it to "UTF-8" using iconv().
I've tried a lot of things but non of them work with Windows-1256.
What I've tried:-

mb_detect_encoding($content) detects Windows-1256 as UTF-8
mb_detect_encoding($content, "windows-1256") gives an error Illegal argument

function is_utf8($string) {   
  return preg_match('%^(?:  
  [\x09\x0A\x0D\x20-\x7E] # ASCII  
  | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte  
  | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs  
  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte  
  | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates  
  | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3  
  | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15  
  | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16  
  )*$%xs', $string);
}

This function returns "0" if not UTF-8 but when string is UTF-8 it returns "Page can not be found". I'm not sure why!
My code is:

$html = file_get_html($url);
foreach($html->find('div[id=content]') as $element) {
  $content = $element->innertext;
  #Detect charset encoding of $content
}

URLs I'm working with:
UTF-8: http://www.masrawy.com/news/Egypt/Politics/2013/March/3/5541050.aspx
Windws-1256: http://www.youm7.com//News.asp?NewsID=965545

2 Answers 2

5

Have you tried using

function is_utf8($string) {
  return (mb_detect_encoding($string, 'UTF-8', true) == 'UTF-8');
}

This works for me on the URLs you're specifying.

Also, I had the masrawy.com site CONSTANTLY fail to load (perhaps why you might be seeing "Page can not be found") while testing a few different options...

Oddly enough, trying to use the regex like you have caused PHP to completely commit suicide for my Windows install, taking Apache down with it.

Sign up to request clarification or add additional context in comments.

1 Comment

Looks like the asker didn't inform us if this work or not . so I test it my self and it works . I'm in same case either the page are windows1256 OR UT-8 this is how call your function and apply another converting function utf8() . if(!is_utf8($t2)) echo $t2=utf8($t2)."<br/>"; else echo $t2."<br/>";
0

This is whole function according to Mark answers and my function I used before

function utf8($utf8){   
if(mb_detect_encoding($string,'UTF-8',true) =='UTF-8'); 
return $utf8; else 
$utf8=iconv("windows-1256","utf-8",$utf8);
return $utf8;
  }

To use it just call the function and it will return correct value .

utf8($text) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.