Revisions to Dynamic type implementation in C

C++ does allow using unsigned char as backing storage for placement new.

Source Link

edited Oct 10, 2022 at 17:51

3.8k
18
28

I'm pretty sure using other types in a char buf[20] is at least technically a strict-aliasing violation, so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

C++ explicitly allows using an array-of-unsigned char object as storage for other objects, via placement-new. But C doesn't have placement-new, so if this is allowed, it would rely on a store beginning the lifetime of an object of one type, and a store with a different type ending the lifetime of the original object. Probably best to use unsigned char instead of char as your building block, even in C.

I'm pretty sure using other types in a char buf[20] is a strict-aliasing violation, so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

I'm pretty sure using other types in a char buf[20] is at least technically a strict-aliasing violation, so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

C++ explicitly allows using an array-of-unsigned char object as storage for other objects, via placement-new. But C doesn't have placement-new, so if this is allowed, it would rely on a store beginning the lifetime of an object of one type, and a store with a different type ending the lifetime of the original object. Probably best to use unsigned char instead of char as your building block, even in C.

added 2342 characters in body

Source Link

edited Oct 7, 2022 at 17:10

Peter Cordes

3.8k
18
28

I'm pretty sure using other types in a char buf[20] is a strict-aliasing violation, too (when you reuse storage for a 2nd time), so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

See Is strict aliasing one-way? on Stack Overflow, and examples of breakage like GCC AVX _m256i cast to int array leads to wrong values.. (GCC defines __m256i with __attribute__((may_alias)), so for aliasing purposes __m256i* works like char*.) But this code isn't reading the char elements of the buffer, only accessing as the type stored, so actual breakage is much less likely.

This might be fine in practice since you're only ever reading using the type of the last write. @Supercat has plenty to say about compilers being overly willing to break code that should clearly work, and problems when reusing the same storage as a different type (e.g. after the 2nd write with a different type). This might be one of those cases that breaks; if you were really going to use this in production, that would be something to check on.

I'm not 100% sure this is a real problem. But if it is, it might not be avoidable. One way might to to declare the object initially with an array member to make it the full size, and then have the library functions use a version the same struct-of-union with a flexible array member at the end. And rely on the common-initial-sequence rule in C to avoid strict-aliasing problems. I hope.

  struct dtype {
     dtype_size sz;
     unsigned char type;   // or dtype_size
     union { 
       signed char sc; unsigned char uc;
       signed short ss; unsigned short us;
       int i; unsigned u;
       // ...
       char str[ max(sizeof(long long), sizeof(long double)) ];
       // or use a user-supplied size, with this declaration as a CPP macro
       // TODO: is it safe to write past the end of this array?
       // That would be much better than a separate space[] member.
     };
     //char space[];   // flexible array member, no size specified.
     // Probably not safe to access outside str[] and into this, especially if it has a specified size.
     // Hopefully str[] inside the union can work as a flexible array member.
  };

Maybe make that definition a macro that takes a size, and declare dtype with DECLARE_DTYPE(/**/); so the array member is empty. And a specific variable gets declared with DECLARE_DTYPE(10) var; to have char str[10]; in the union. So after expansion, inside a function you'd have struct local_dtype { ... ; union{ ...; char str[32]; }; };. And you'd be passing a pointer to that as an arg to a function declared as taking void* or a pointer to struct dtype { ...; union {...; char str[16]; };};. Probably have to be void* because implicit conversion from &var to a different struct tag won't work, even if they have a compatible initial sequence.

I'm not fully sure this would work with ISO C rules, this is just an idea / suggestion for something to look into. The common initial sequence rule is something you'd have to look into carefully, and flexible array members. It would also I think rule out runtime-variable sizes, which you could do with a VLA using your char array method.

I'm pretty sure using other types in a char buf[20] is a strict-aliasing violation, too (when you reuse storage for a 2nd time), so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

See Is strict aliasing one-way? on Stack Overflow, and examples of breakage like GCC AVX _m256i cast to int array leads to wrong values.. (GCC defines __m256i with __attribute__((may_alias)), so for aliasing purposes __m256i* works like char*.)

This might be fine in practice since you're only ever reading using the type of the last write. @Supercat has plenty to say about compilers being overly willing to break code that should clearly work, and problems when reusing the same storage as a different type. This might be one of those cases that breaks; if you were really going to use this in production, that would be something to check on.

I'm pretty sure using other types in a char buf[20] is a strict-aliasing violation, so maybe you should be using memcpy to get data in/out. This isn't anonymous memory you got from malloc; it has a static type, which is char. char* can alias anything, but int* isn't allowed to alias char objects or arrays.

See Is strict aliasing one-way? on Stack Overflow, and examples of breakage like GCC AVX _m256i cast to int array leads to wrong values.. (GCC defines __m256i with __attribute__((may_alias)), so for aliasing purposes __m256i* works like char*.) But this code isn't reading the char elements of the buffer, only accessing as the type stored, so actual breakage is much less likely.

This might be fine in practice since you're only ever reading using the type of the last write. @Supercat has plenty to say about compilers being overly willing to break code that should clearly work, and problems when reusing the same storage as a different type (e.g. after the 2nd write with a different type). This might be one of those cases that breaks; if you were really going to use this in production, that would be something to check on.

I'm not 100% sure this is a real problem. But if it is, it might not be avoidable. One way might to to declare the object initially with an array member to make it the full size, and then have the library functions use a version the same struct-of-union with a flexible array member at the end. And rely on the common-initial-sequence rule in C to avoid strict-aliasing problems. I hope.

  struct dtype {
     dtype_size sz;
     unsigned char type;   // or dtype_size
     union { 
       signed char sc; unsigned char uc;
       signed short ss; unsigned short us;
       int i; unsigned u;
       // ...
       char str[ max(sizeof(long long), sizeof(long double)) ];
       // or use a user-supplied size, with this declaration as a CPP macro
       // TODO: is it safe to write past the end of this array?
       // That would be much better than a separate space[] member.
     };
     //char space[];   // flexible array member, no size specified.
     // Probably not safe to access outside str[] and into this, especially if it has a specified size.
     // Hopefully str[] inside the union can work as a flexible array member.
  };

Maybe make that definition a macro that takes a size, and declare dtype with DECLARE_DTYPE(/**/); so the array member is empty. And a specific variable gets declared with DECLARE_DTYPE(10) var; to have char str[10]; in the union. So after expansion, inside a function you'd have struct local_dtype { ... ; union{ ...; char str[32]; }; };. And you'd be passing a pointer to that as an arg to a function declared as taking void* or a pointer to struct dtype { ...; union {...; char str[16]; };};. Probably have to be void* because implicit conversion from &var to a different struct tag won't work, even if they have a compatible initial sequence.

I'm not fully sure this would work with ISO C rules, this is just an idea / suggestion for something to look into. The common initial sequence rule is something you'd have to look into carefully, and flexible array members. It would also I think rule out runtime-variable sizes, which you could do with a VLA using your char array method.

added 1317 characters in body

Source Link

edited Oct 7, 2022 at 13:53

Peter Cordes

3.8k
18
28

I didn't look at the details of everything else, mostly just wanted to talk in more detail about those two issues, but did spot a few other things.

typedef unsigned short DTYPE_SIZE; seems odd to me; normally type names aren't in all-caps, and it seems very ugly. Unfortunately all names ending with _t are reserved by POSIX, but you could use dtype_size or dtype_size_type. See on SO: C type naming conventions, _t or ALLCAPS - all the answers there recommend against ALL_CAPS type names. C stdio's FILE* is an ancient exception, not a model to follow. (C stdio predates the C language even being fully C so has other clunky things like strings for fopen modes, not constant flags because the C preprocessor didn't exist yet.)

pacmaninbw pointed out that _dtset_obj(var, value, strlen(value) + 1) in dtset_str truncates the size_t return value to dtype_size, which is currently unsigned short.

You might might want to define this function specially, and strncpy with the length as a limit. But that would force zeroing of the full dtype buffer out to its max size, because strncpy has a silly definition (or is designed for a use-case other than length-capped strcpy). And strncpy still wouldn't guarantee zero-termination, so maybe not a good suggestion.

We know that the max size of a dtype object is USHORT_MAX, so very large strings can't go into one in the first place.

But _dtset_obj doesn't need to take a narrow arg. It can and should take size_t or at least unsigned, so it can check to see if you're trying to copy from a way-too-big string.

Instead, the actual behaviour is that it truncates the string length to (usually) 16-bit before checking. So worst case (strlen+1) % 65536 is some small number that fits, and we take first few bytes of the string. Even if that's less than the the size of the dtype object. This is weird and should be avoided, so take a size_t arg.

I didn't look at the details of everything else, mostly just wanted to talk in more detail about those two issues.