2

I am parsing a text file and am occassionally running into data such as:

CASTA¥EDA, JASON  

Using a Mongo DB backend when I try saving information, I am getting errors like:

[MongoDB\Driver\Exception\UnexpectedValueException]
  Got invalid UTF-8 value serializing 'Jason Casta�eda'

After Googling a few places, I located two functions that the author says would work:

 function is_utf8( $str )
    {
        return preg_match( "/^(
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
      )*$/x",
            $str
        );
    }

    public function force_utf8($str, $inputEnc='WINDOWS-1252')
    {
        if ( $this->is_utf8( $str ) ) // Nothing to do.
            return $str;

        if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
            return utf8_encode( $str );

        if ( function_exists( 'mb_convert_encoding' ) )
            return mb_convert_encoding( $str, 'UTF-8', $inputEnc );

        if ( function_exists( 'iconv' ) )
            return iconv( $inputEnc, 'UTF-8', $str );

        // You could also just return the original string.
        trigger_error(
            'Cannot convert string to UTF-8 in file '
            . __FILE__ . ', line ' . __LINE__ . '!',
            E_USER_ERROR
        );
    }

Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?

3
  • Take a look at stackoverflow.com/questions/5920626/… Commented Oct 13, 2016 at 22:11
  • Thanks @GerardRoche I did look at it and tried using iconv but no change in behavior. Commented Oct 13, 2016 at 22:15
  • Try 'Latin1' as $inputEnc Commented Oct 13, 2016 at 22:32

1 Answer 1

0

This question is pretty old, but for those who face same issue and get on this page like me:

mb_convert_encoding($value, 'UTF-8', 'UTF-8');

This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.