2

I'm trying to parse text from tweets in PHP 5.3, but I have a problem with parsing user mentions, hashtags and links which contain Unicode characters.

First I fetch tweets and store it to txt file:

$tweets_file = createFile('cache/'.$twitteruser.'-tweets.txt', json_encode($tweets));

After that, in my text file I can see bunch of Unicode characters (e.g. Landsli\u00f0sma\u00f0ur).

When I try to display all the tweets I do it like this:

function twitterify($text) {
  $text = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $text);
  $text = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $text);
  $text = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $text);
  $text = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $text);
  return $text;
}

$tweets_file = file_get_contents('cache/'.$queried_user.'-tweets.txt');
$tweets = json_decode($tweets_file);
foreach($tweets as $tweet) {
  echo twitterify($tweet->text);
  // do other stuff...
}

Everything works fine here until there is a Unicode character in hashtag for example. My preg_replace stops at that character and a hashtag like #rafhlaða renders to <a href="#">#rafhla</a>ða.

What can I do to properly render out text with Unicode characters in it?

2
  • plz, post the contents of the file into pastebin and add the link to question Commented Jul 20, 2013 at 20:45
  • Contents of the file is here pastebin.com/kzXqwwVT Commented Jul 20, 2013 at 20:50

2 Answers 2

1

I can't reproduce your error. I took JSON data from pastebin and modified it to the simplest case:

[{"text":"#rafhla\u00f0a"}]

So, the text is just 1 word: rafhlaða

Then ran the following script:

<?php
function twitterify($ret) {
    $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
    $ret = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
    return $ret;
}


$tweets_file = file_get_contents('file.txt');
$tweets = json_decode($tweets_file);
foreach($tweets as $tweet) {
    print $tweet->text;
    print "\n";
    echo twitterify($tweet->text);
    exit;
}

It printed:

#rafhlaða
<a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a>

Which contradicts to your statement:

#rafhlaða renders to <a href="#">#rafhla</a>ða

update

<?php
function twitterify($ret) {
    $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
    $ret = preg_replace("/@(.+?)(?=\s|$)/", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
    $ret = preg_replace("/#(.+?)(?=\s|$)/", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
    return $ret;
}


$tweet = '[{"text":"#rafhla\u00f0a #rafhla\u00f0a"}]';
$tweet = json_decode($tweet);
print $tweet[0]->text;
print "\n";
echo twitterify($tweet[0]->text);

prints:

#rafhlaða #rafhlaða

<a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a> <a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a>

Sign up to request clarification or add additional context in comments.

9 Comments

Well, this is odd... I triple checked your snippet with mine but I still have this problem =/
@errata Maybe, something is different in Unicode support in our php interpreters? Mine is PHP 5.4.7 (cli) (built: Sep 14 2012 14:44:02) Running on Linux Slackware 14.0
I even tried to reproduce your case. Still got the same problem... My PHP is v5.3.15 on Mac OS X 10.8.4.
@errata Here I tried to reproduce online, but it works successfully ideone.com/M1umyK
@errata I found the cause: "This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32" php.net/manual/en/reference.pcre.pattern.modifiers.php
|
0

Try adding this to your script (and leave out the preg_replace):

header('Content-Type: application/json; Charset=UTF-8');

Solution two :

$tweets_file = file_get_contents('cache/'.$queried_user.'-tweets.txt', FILE_TEXT);

3 Comments

Hm, but then my page does not render as HTML content but as plain text? Like I'm looking at the source of my file.
Then the question is, why are you storing JSON data in a txt file (cache/'.$queried_user.'-tweets.txt) and not a .json file?
I tried to save .json file and also tried to add FILE_TEXT when reading txt file. Still the same problem =(

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.