7

From the MongoDB manual:

By default, all database strings are UTF8. To save images, binaries, and other non-UTF8 data, you can pass the string as a reference to the database.

I'm fetching pages and want store the content for later processing.

  • I can not rely on meta-charset, because many pages has utf8 content but wrongly declaring iso-8859-1 or similar
  • so can't use Encode (don't know the originating charset)
  • therefore, I want store the content simply as flow of bytes (binary data) for later processing

Fragment of my code:

sub save {
    my ($self, $ok, $url, $fetchtime, $request ) = @_;

    my $rawhead = $request->headers_as_string;
    my $rawbody = $request->content;

    $self->db->content->insert(
        { "url" => $url, "rhead" => \$rawhead, "rbody" => \$rawbody } ) #using references here
      if $ok;

    $self->db->links->update(
        { "url" => $url },
        {
            '$set' => {
                'status'       => $request->code,
                'valid'        => $ok,
                'last_checked' => time(),
                'fetchtime'    => $fetchtime,
            }
        }
    );
}

But get error:

Wide character in subroutine entry at /opt/local/lib/perl5/site_perl/5.14.2/darwin-multi-2level/MongoDB/Collection.pm line 296.

This is the only place where I storing data.

The question: The only way store binary data in MondoDB is encode them e.g. with base64?

1
  • Will it give the same warning if you set $rawhead and $rawbody to the sample given in the manual (i.e., "\xFF\xFE\xFF")? Commented Jun 20, 2012 at 9:20

2 Answers 2

3

It looks like another sad story about _utf8_ flag...

I may be wrong, but it seems that headers_as_string and content methods of HTTP::Message return their strings as a sequence of characters. But MongoDB driver expects the strings explicitly passed to it as 'binaries' to be a sequence of octets - hence the warning drama.

A rather ugly fix is to take down the utf8 flag on $rawhead and $rawbody in your code (I wonder shouldn't it be really done by MongoDB driver itself?), by something like this...

_utf8_off $rawhead; 
_utf8_off $rawbody; # ugh

The alternative is to use encode('utf8', $rawhead) - but then you should use decode when extracting values from DB, and I doubt it's not uglier.

Sign up to request clarification or add additional context in comments.

Comments

0

Your data is characters, not octets. Your assumption seems to be that you are just passing things through as octets, but you must have violated that assumption earlier somehow by decoding incoming text data, perhaps even without you noticing.

So simply do not decode, data stay octets, storing into the db won't fail.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.