1

Take the below piece of code, that simply Trims a string, removing whitespace characters from either end:

const std::string TrimString(const std::string& s)
{
    const auto iter = std::find_if(s.cbegin(), s.cend(), [](auto c) -> bool { return !std::isspace(static_cast<int32_t>(c)); });
    return iter != s.end() ?
        std::string(iter, std::find_if(s.crbegin(), s.crend(), [](auto c) -> bool { return !std::isspace(static_cast<int32_t>(c)); }).base()) :
        std::string();
}


//Usage
std::vector<uint8_t> d{ 0xc5, 0xbc };     // example UTF-8 character
std::string uft8(d.begin(), d.end());

std::string trimmed = TrimString(utf);

The above code if you run this on MSVC (17.14.19) will actually crash, but on Linux using GCC (14.2.0) it will work perfectly.

Now I know WHY it crashes and it's easy enough to fix, but what I'm looking is trying to understand this difference and even, what the standard says about this.

The reason for the crash is that on MSVC, std::isspace takes an int and that it must be in the range of -1 -> 255 (according to the runtime crash dialog). But then, why does this work on GCC?

Obviously, this has to do with the auto as the parameter of the lambda. In MSVC, the auto parameter of the lambda is probably a char so each byte is being sign-extended and that's what causes the crash (as it ends up as a negative value). What I'm not sure about is what is happening in the case of GCC. Surely, this would also be doing something similar? Is std::isspace less picky on Linux?

As I said it's an easy fix, but am looking for more understanding of the difference between MSVC and GCC in this regard.

14
  • 5
    From std::isspace documentation: "The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF". Your program takes 0xC5, converts it to char (when constructing the std::string), which likely turns it into a negative value. Then that negative value is converted to int - still negative. Then you pass that negative value to std::isspace, whereupon your program exhibits undefined behavior. "Seems to work" is one possible manifestation of undefined behavior; "crash" is another. Commented Nov 12 at 19:02
  • 1
    Undefined behavior is undefined. "Seems to work" is one possible manifestation of undefined behavior. Practically speaking, it appears that MSVC implementation chose to add an assert for the proper range (perhaps only in debug build); GCC implementation chose not to. Commented Nov 12 at 19:06
  • 1
    @the welder "is the auto actually a char or an int?" - It is whatever you initialized it with. Commented Nov 12 at 19:10
  • 3
    The auto should be char in this example, with both implementations. You can confirm this by printing sizeof(c), or typeid(c).name(), or adding static_assert(std::is_same_v<decltype(c), char>) Commented Nov 12 at 19:11
  • 1
    @TheWelder "If you change the auto to a uint8_t it will work perfectly well" - Yeah, if you change auto to some type that is different to what will be deduced from its initializer it will have different behaviour - of course. But that's not the fault of auto - auto just does what it is specified to do; take on the type of what it is initialised with. Commented Nov 12 at 19:12

2 Answers 2

9

The auto is deduced as char for both compilers. The difference is mainly due to Windows' C runtime library being more strict.

The standard says ([basic.fundamental]/7):

Type char is a distinct type that has an implementation-defined choice of “signed char” or “unsigned char” as its underlying type.

On MSVC, char is signed by default (unless the /J option is on). On GCC, the default signedness of char depends on the target: signed on x86, unsigned on Linux arm64, etc.

The C++ standard specifies that std::isspace has the same meaning as the C standard isspace. The C standard specifies that isspace takes an int, and that the argument value must either be representable as an unsigned char or equal EOF. When char is signed, 0xc5 cast to char then to int is a negative value and thus not representable as an unsigned char.

On Windows, the debug C runtime library checks whether the argument is within the required range, and raises an exception if it isn't (ref).

The behavior of isspace and _isspace_l is undefined if c isn't EOF or in the range 0 through 0xFF, inclusive. When a debug CRT library is used and c isn't one of these values, the functions raise an assertion.

On GNU/Linux, however, the C library allows negative values as an extension (ref).

As an extension, the GNU C Library accepts signed char values as ‘is’ functions arguments in the range -128 to -2, and returns the result for the corresponding unsigned character. However, as there might be an actual character corresponding to the EOF integer constant, doing so may introduce bugs, and it is recommended to apply the conversion to the unsigned character range as appropriate.

Sign up to request clarification or add additional context in comments.

2 Comments

gcc doesn't default char to unsigned: godbolt.org/z/ze9WdMrvf
the only difference between MSVC and gcc is the implementation of isspace: godbolt.org/z/Ynv7b44WM
1

On Both compilers auto is deduced to char and char is a signed value (though on other compilers, or with different compiler flags it might be unsigned).

std::isspace has undefined behaviour if:

ch is not representable as unsigned char and is not equal to EOF

You therefore need to cast the values to unsigned char or, more simply, change auto to unsigned char to get the correct result.

The difference between the two compilers is that with debug flags enabled MSVC adds additional checks for undefined behaviour, this includes validating the values passed to std::isspace. If you build without the debug flags no error is shown: https://godbolt.org/z/Wj98z337z. If you use unsigned char instead of auto the code works in all cases:

const std::string TrimString(const std::string& s)
{
    const auto iter = std::find_if(s.cbegin(), s.cend(), [](unsigned char c) -> bool { return !std::isspace(static_cast<int32_t>(c)); });
    return iter != s.end() ?
        std::string(iter, std::find_if(s.crbegin(), s.crend(), [](unsigned char c) -> bool { return !std::isspace(static_cast<int32_t>(c)); }).base()) :
        std::string();
}

https://godbolt.org/z/54br8bb3G

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.