2

I generate a lot of posts in Wordpress from an XML file. The worry: accented characters.

The header of the stream is:

<? Xml version = "1.0" encoding = "ISO-8859-15"?>

Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54

My site is in utf8.

So I use the function utf8_encode ... but that does not solve the problem, the accents are always misunderstood.

Does anyone have an idea?

EDIT 04-10-2011 18:02 (french hour) :

Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54

Here is my code :

/**
 * parse an rss flux from netaffiliation and convert each item to posts
 * @var $flux = external link
 * @return bool
 */
private function parseFluxNetAffiliation($flux)
{
    $content = file_get_contents($flux);
    $content = iconv("iso-8859-15", "utf-8", $content);

    $xml = new DOMDocument;
    $xml->loadXML($content);

    //get the first link : http://www.netaffiliation.com
    $link = $xml->getElementsByTagName('link')->item(0);
    //echo $link->textContent;

    //we get all items and create a multidimentionnal array
    $items = $xml->getElementsByTagName('item');

    $offers = array();
    //we walk items
    foreach($items as $item)
    {
        $childs = $item->childNodes;

        //we walk childs
        foreach($childs as $child)
        {
            $offers[$child->nodeName][] = $child->nodeValue;
        }

    }
    unset($offers['#text']);

    //we create one article foreach offer
    $nbrPosts = count($offers['title']);

    if($nbrPosts <= 0) 
    {
        echo self::getFeedback("Le flux ne continent aucune offre",'error');
        return false;
    }

    $i = 0;
    while($i < $nbrPosts)
    {
        // Create post object
        $description = '<p>'.$offers['description'][$i].'</p><p><a href="'.$offers['link'][$i].'" target="_blank">'.$offers['link'][$i].'</a></p>';

        $my_post = array(
            'post_title' => $offers['title'][$i],
            'post_content' => $description,
            'post_status' => 'publish',
            'post_author' => 1,
            'post_category' => array(self::getCatAffiliation())
        );

        // Insert the post into the database
        if(!wp_insert_post($my_post));;

        $i++;
    }

    echo self::getFeedback("Le flux a généré {$nbrPosts} article(s) depuis le flux NetAffiliation dans la catégorie affiliation",'updated');
    return false;

}

All the posts are generated but... the accented chars are ugly. You can see the result here: http://monsieur-mode.com/test/

5
  • Why do you need to generate iso-8859-15? Commented Oct 4, 2011 at 15:34
  • The stream is not mine. The stream is in iso-8859-15 and I want to get the content in UTF8 to be clean in my website. Commented Oct 4, 2011 at 15:36
  • Hmm. LoadXML will parse the XML header, so iconv() won't help here. You might have to enforce the correct encoding to LoadXML but I don't know how... Hmmm Commented Oct 4, 2011 at 16:07
  • Maybe I should replace (with preg_replace) the "encoding" part by "UTF8" or ... by blank. I'll try this. Commented Oct 5, 2011 at 7:52
  • it might work in conjunction with iconv ... Can you put an example XML online? Commented Oct 5, 2011 at 8:31

4 Answers 4

2

There are plenty difficulties which you have to master when swapping between different encodings. Also, encodings which use more than one byte to encode characters (so-called multibyte-encodings) like UTF-8, which is used by WordPress, deserve special attention in PHP.

  • First, make sure that all the files you create are saved with the same encoding as they will be served. For example, make sure you set the same encoding as in the "Save as..."-dialog as you use in the HTTP Content-Type header.
  • Second, you need to verify that the input has the same encoding as the file you want to deliver. In your case, the input file has the encoding ISO-8859-15, so you'll need to convert it to UTF-8 using iconv().
  • Third, you must know that PHP doesn't natively support multibyte-encodings such as UTF-8. Functions such as htmlentities() will produce strange characters. For many of these functions, there are multibyte-alternatives, which are prefixed with mb_. If your encoding is UTF-8, check your files for such functions and replace them if necessary.

For more information about these topics, see Wikipedia about variable-width encodings, and the page in the PHP-Manual.

Sign up to request clarification or add additional context in comments.

2 Comments

Hello, all my files are encoding in UTF8 (default option in aptana). My meta charset is UTF8 and I don't use htmlentities. Thank you anyway for your help
This is very good advice, but not for this specific situation
0

By default, most application work with UTF-8 data and output UTF-8 content. Wordpress should definitely not be apart and surely works on a UTF-8 basis.

I would simply not convert at all any information when printing, but instead change your header to UTF-8 instead of ISO-8859-15.

Comments

0

If your incoming XML data is ISO-8859-15, use iconv() to convert it:

$stream = file_get_contents("stream.xml");
$stream = iconv("iso-8859-15", "utf-8", $stream);

3 Comments

Accents are transformed into other characters with your code :s, thank you anyway
@Raphael can you show the code you are using and examples of how the characters are breaking? The situation is a bit unclear
I edit my answer with the stream link and my code. I must go at my home, so if you answer to me, I see it only tomorrow. Thank for your help =)
0

mb_convert_encoding()saves my life.

Here is my solution :

    $content = preg_replace('/ encoding="ISO-8859-15"/is','',$content);
    $content = mb_convert_encoding($content,"UTF-8");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.