I've been experimenting with memcpy() implementations and optimizations in C; I recently wrote a version which was supposed to use successive powers-of-two to speed up the copy:
- first, large 8-byte blocks are copied as
uint64_t - the same is successively performed for
uint32_tanduint16_t - finally, any remainder is copied bytewise
The example implementation and test main I wrote functions successfully, although at least one test case (string length = 72) appears to add two additional spurious bytes to the end of the string.
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
static inline size_t detalign(size_t size)
{
if (size == 0) return 0;
if (!(size & 0x7)) return 8;
if (!(size & 0x3)) return 4;
if (!(size & 0x1)) return 2;
else return 1;
}
size_t strlen(const char *str)
{
size_t k = 0;
while (str[k] != '\0')
k++;
return k;
}
void *memcpy(void *restrict dest, const void *restrict src, size_t count) {
if (count < 2048) {
int sizes[4] = { 8, 4, 2, 1 };
int pos = 0, rem = count;
while (rem > 0) {
// find the greatest power-of-two divisor which fits the remaining source count
int divisor = 0;
for (int i = 0; i < 4; i++) {
if (rem / sizes[i] && sizes[i] > divisor)
divisor = sizes[i];
}
// switch over the sizes, because each size must be handled with its own type.
int iter = 0;
int reps = rem / divisor;
switch (divisor) {
case 8:
uint64_t *restrict qdest = (uint64_t *) (dest + pos), *restrict qsrc = (uint64_t *) (src + pos);
for (int j = 0; j < reps; j++)
qdest[j] = qsrc[j];
printf("Iteration %d: start position %d, end position %d, size %d.\n", iter, pos, pos + (reps * divisor), divisor);
break;
case 4:
uint32_t *restrict ddest = (uint32_t *) (dest + pos), *restrict dsrc = (uint32_t *) (src + pos);
for (int j = 0; j < reps; j++)
ddest[j] = dsrc[j];
printf("Iteration %d: start position %d, end position %d, size %d.\n", iter, pos, pos + (reps * divisor), divisor);
break;
case 2:
uint16_t *restrict hdest = (uint16_t *) (dest + pos), *restrict hsrc = (uint16_t *) (src + pos);
for (int j = 0; j < reps; j++)
hdest[j] = hsrc[j];
printf("Iteration %d: start position %d, end position %d, size %d.\n", iter, pos, pos + (reps * divisor), divisor);
break;
case 1:
uint8_t *restrict bdest = (uint8_t *) (dest + pos), *restrict bsrc = (uint8_t *) (src + pos);
for (int j = 0; j < reps; j++)
bdest[j] = bsrc[j];
printf("Iteration %d: start position %d, end position %d, size %d.\n", iter, pos, pos + (reps * divisor), divisor);
break;
}
pos += reps * divisor;
rem -= reps * divisor;
iter++;
}
}
}
int main() {
char *string1 = "This is an example string. Please note I have not precomputed the length.";
size_t length = strlen((const char *) string1);
char *string2 = (char *) malloc(length * sizeof(char));
memcpy(string2, string1, length);
printf("\n");
printf("STRING 1 (%d bytes): \"%s\"\n", length * sizeof(char), string1);
printf("STRING 2 (%d bytes): \"%s\"\n", strlen(string2) * sizeof(char), string2);
}
As one can see, the initial test cases work:
Iteration 0: start position 0, end position 72, size 8.
Iteration 0: start position 72, end position 73, size 1.
STRING 1 (73 bytes): "This is an example string. Please note I have not precomputed the length."
STRING 2 (73 bytes): "This is an example string. Please note I have not precomputed the length."
and
Iteration 0: start position 0, end position 24, size 8.
Iteration 0: start position 24, end position 28, size 4.
Iteration 0: start position 28, end position 29, size 1.
STRING 1 (29 bytes): "BLAH BLAH BLAH BLAH BLAH BLAH"
STRING 2 (29 bytes): "BLAH BLAH BLAH BLAH BLAH BLAH"
However, delete the period off the end of the first test case (length = 72) and now the copied string possesses 74 bytes rather than 72 in the input string. Similarly, the second test case displays the same issue: delete the last "BLAH" and now the 24-byte string results in a 26-byte copy.
How do I go about fixing this? I attempted to place iteration counts in the inside of the while() structure to see if it would point out a failure in one of the conditions, but couldn't find anything readily sticking out.
string2buffer is not\0terminated, so it's no "string"...void*return type declaration formemcpya typo forvoid? Otherwise, that function should return something. And you should tell us what compiler you are using: Standard C doesn't allow arithmetic onvoid*pointers (like indest + pos) but (IIRC) GCC does.return destat function-level scope at the end. I'm using GCC (13.2).iterdoes not count, because you reset it in each iteration.strlen(string2)will count something arbitrary as string2 is not null terminated