2

I have a Ruby on Rails application that is a server for Java and .Net apps. I have a custom header I'm using to send some data, but when this data reaches the Ruby on Rails app, Rails reads the value as UTF-8 then says the value is not a valid UTF-8 string.

For instance, if I send JÜRGENELITE-HP I get:

#<ActiveRecord::StatementInvalid: PGError: ERROR:  invalid byte sequence for encoding "UTF8": 0xdc52
: SELECT * FROM "replicas" WHERE ("replicas"."identification" = 'J?RGENELITE-HP') AND ("replicas".user_id = 121)  LIMIT 1>

The Java HTTP Client library clearly prints the data correctly in the console:

DEBUG [main] (DefaultClientConnection.java:268) - >> POST /ze/api/files.json HTTP/1.1
DEBUG [main] (DefaultClientConnection.java:271) - >> X-Replica: JÜRGENELITE-HP
DEBUG [main] (DefaultClientConnection.java:271) - >> Authorization: Basic bWxpbmhhcmVzOjEyMzQ1Njc4

DEBUG [main] (DefaultClientConnection.java:271) - >> Content-Length: 0
DEBUG [main] (DefaultClientConnection.java:271) - >> Host: localhost:3000
DEBUG [main] (DefaultClientConnection.java:271) - >> Connection: Keep-Alive
DEBUG [main] (DefaultClientConnection.java:271) - >> User-Agent: Apache-HttpClient/4.1.2 (java 1.5)

But when it reaches Rails it breaks. What encoding does HTTP uses to encode header values?

2

1 Answer 1

2

US-ASCII

If you look at section 2.2 of RFC2616:

2.2 Basic Rules

The following rules are used throughout this specification to
describe basic parsing constructs. The US-ASCII coded character set
is defined by ANSI X3.4-1986 [21].

   OCTET          = <any 8-bit sequence of data>
   CHAR           = <any US-ASCII character (octets 0 - 127)>
   UPALPHA        = <any US-ASCII uppercase letter "A".."Z">
   LOALPHA        = <any US-ASCII lowercase letter "a".."z">
   ALPHA          = UPALPHA | LOALPHA
   DIGIT          = <any US-ASCII digit "0".."9">
   CTL            = <any US-ASCII control character
                    (octets 0 - 31) and DEL (127)>
   CR             = <US-ASCII CR, carriage return (13)>
   LF             = <US-ASCII LF, linefeed (10)>
   SP             = <US-ASCII SP, space (32)>
   HT             = <US-ASCII HT, horizontal-tab (9)>
   <">            = <US-ASCII double-quote mark (34)>

The remainder of the section has more specific information about headers and other elements of the protocol.

You have to jump around the spec quite a bit to find all of the right BNF definitions. Section 4.2 contains the definition for headers, though:

   message-header = field-name ":" [ field-value ]
   field-name     = token
   field-value    = *( field-content | LWS )
   field-content  = <the OCTETs making up the field-value
                    and consisting of either *TEXT or combinations
                    of token, separators, and quoted-string>

TEXT is defined back in Section 2.2:

   TEXT           = <any OCTET except CTLs,
                    but including LWS>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.