1

In C (and similar languages), a string is declared for example as "abc". Another example is "ab\"c". I have a file which contains these strings. That is, the file contents is "abc" or "ab\c" etc. Any literal string that can be defined in a .c file can be defined in the file I'm reading.

These strings can be malformed. E.g. "abc (no closing quotes). What is the best way to write a parser to make sure the string in the file is a valid C literal string? (so that if I copy the file contents and paste them after char* str =, the resulting expression will be accepted by the compiler when at the top of a function)

The strings are each in a separate line.

Alternatively, you can think of this as wanting to parse lines that declare literal string variables. Imagine I'm grepping a big file and use char\* .* = (.*);$ and want to make sure the part in the parenthesis will not cause compilation errors;

13
  • 1
    Is each string on a new line? Commented Aug 9, 2020 at 16:51
  • @Carcigenicate the string ends at the end of the line. Commented Aug 9, 2020 at 16:56
  • @SourabhChoure yes Commented Aug 9, 2020 at 16:56
  • 1
    @SourabhChoure I want to lose them Commented Aug 9, 2020 at 17:13
  • 1
    See flex, what should be sufficient is tokenizing the input and checking that it contains only string tokens Commented Aug 9, 2020 at 17:42

1 Answer 1

3

The grammar for C string literals is given in C 2018 6.4.5. Supposing you want to parse only plain strings, not those with encoding prefixes such as u in u"xyz", then the grammar for a string-literal is " s-char-sequenceopt ", where “opt” means optional and s-char-sequence is one or more s-char tokens. An s-char is any member of the source character set except ", \ or the new-line character or is an escape-sequence.

The source character set includes at least the Latin alphabet (26 letters A-Z) in uppercase and lowercase, the ten digits, space, horizontal tab, vertical tab, form feed, and these characters:

"#%&’()*+,-./:;?[\]^_{|}~

However, a C implementation may include other characters in its source character set. Therefore, any character found in the string other than ", \, or the new-line character must be accepted as potentially valid in some C implementation.

An escape-sequence is defined in 6.4.4.4 1 to be one of:

  • \ followed by ', ", ?, \, a, b, f, n, r, t, v,
  • \ followed by one to three octal digits, or
  • \x followed by one or more hexadecimal digits, or
  • a universal-character-name.

Paragraph 7 says:

Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.

A universal-character-name is defined in 6.4.3 to be \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits. Paragraph 2 limits these:

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.

This part of the C grammar looks fairly simple to parse:

  • A string literal must start with a ".
  • If the next character is anything other than ", \, or a new-line character, then accept it.
  • If the next character is \ and it is followed by one of the single characters listed above, accept it and the following character.
  • If the next character is \ and it is followed by one to three octal digits, accept it and up to three octal digits.
  • If the next two characters are \x and are followed by a hexadecimal digit, accept them and all the hexadecimal digits that follow.
  • If the next two characters are \u and are followed by four hexadecimal digits, accept those six characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
  • If the next two characters are \U and are followed by eight hexadecimal digits, accept those ten characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
  • Repeat the above until the next character is not accepted.
  • If the next character is not ", this is not a valid C string literal.
  • If the next character is ", accept it.
  • If that is the end of the line read from the file, it is a valid C string literal. Otherwise, it is not.
Sign up to request clarification or add additional context in comments.

2 Comments

There's also concatenation of adjacent string literals, and predefined string literals like __FILE__. Both satisfy OP's rather loose requirement "that if I copy [...] paste them after char* str =, the resulting expression will be accepted by the compiler" For example, the line "hello, " /* comment */ __FILE__ is valid by OP's definition.
@user3386109: Considering arbitrary function calls and other expressions could be placed there, as well as ; followed by arbitrary statements and declarations, I do not think that is what they really want. They did not cite that as a clear definition of what they want, just an example for illustration.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.