I'm trying to parse text from tweets in PHP 5.3, but I have a problem with parsing user mentions, hashtags and links which contain Unicode characters.
First I fetch tweets and store it to txt file:
$tweets_file = createFile('cache/'.$twitteruser.'-tweets.txt', json_encode($tweets));
After that, in my text file I can see bunch of Unicode characters (e.g. Landsli\u00f0sma\u00f0ur).
When I try to display all the tweets I do it like this:
function twitterify($text) {
$text = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $text);
$text = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $text);
$text = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $text);
$text = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $text);
return $text;
}
$tweets_file = file_get_contents('cache/'.$queried_user.'-tweets.txt');
$tweets = json_decode($tweets_file);
foreach($tweets as $tweet) {
echo twitterify($tweet->text);
// do other stuff...
}
Everything works fine here until there is a Unicode character in hashtag for example. My preg_replace stops at that character and a hashtag like #rafhlaða renders to <a href="#">#rafhla</a>ða.
What can I do to properly render out text with Unicode characters in it?