I have a piece of software that generates a rather large text file full of information about files in a directory. There are often several thousand files. Each one gets a set of informational entries, looking something like:
number
number
IMPORTANT NUMBER
info
info
info
info
info
These repeat. The text file will have the same eight lines for each file in the directory.
I'm supposed to sort this text file by IMPORTANT NUMBERs, the values that appear on the 3rd line, then the 3+8th line, then the 3 + 8*2 line, etc.
Currently, I'm reading them into a multidimensional character array, looking like:
[number][number][IMPORTANT NUMBER 1][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 2][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 3][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 4][info][info][info][info][info]
etc.
The idea is to sort each set of 8 entries by the important numbers in ascending order. If, for example, my array looked like this:
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number1][number1][1][info1][info1][info1][info1][info1]
[number4][number4][4][info4][info4][info4][info4][info4]
After sorting, it would look like:
[number1][number1][1][info1][info1][info1][info1][info1]
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number4][number4][4][info4][info4][info4][info4][info4]
...with the value in arr[2] (1,2,3,4...) being used for the sorting.
The issue is that the information stored in the other columns tends to vary in size. arr[3] might have a length of 30 characters. arr[4] might have a length in excess of 5000. Doing this for a lot of data adds up quickly enough that I don't want to just allocate set sizes of whatever that max length might be, especially if I'm just going to use a tiny fraction of it at a time in most of the cases.
I've found a lot of good answers on using qsort, but very few on sorting large multidimensional string arrays. I'd also LIKE to use qsort, because I'd prefer not to reinvent the wheel and I doubt anything I'd write would be as efficient.
If anyone could shed some light on how this might be accomplished, I'd appreciate it.
Current code is:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define FIELDS 8
int compare(const void *row1, const void *row2);
int main(int argc, char *argv[])
{
// (1) - Open File
const char fname[] = "arrayFile.txt";
FILE *fp = fopen(fname, "r");
printf("Opened file: %s\n", fname);
// (2) - Count Lines
char cr;
size_t lines = 0;
while (cr != EOF)
{
if (cr == '\n')
{
lines++;
}
cr = getc(fp);
}
rewind(fp);
// (3) - Populate Array
char *data[lines / FIELDS][FIELDS];
lines = lines / FIELDS;
size_t n;
for (int i = 0; i < lines; i++)
{
for (int j = 0; j < FIELDS; j++)
{
data[i][j] = NULL;
size_t n = 0;
getline(&data[i][j], &n, fp);
}
}
// (4) - Print Array Before
for (int i = 0; i < lines; i++)
{
for (int j = 0; j < FIELDS; j++)
{
printf("%s", data[i][j]);
}
}
printf("\n\nNot sorted\n\n");
// (5) - Sort Array
qsort(data, lines, sizeof(data[0]), compare);
printf("\n\nsorted\n\n");
// (6) - Print Array After
for (int i = 0; i < lines; i++)
{
for (int j = 0; j < FIELDS; j++)
{
printf("%s", data[i][j]);
free(data[i][j]);
}
}
// Close File
fclose(fp);
printf("\n\nNumber of files: %ld\n", lines);
printf("\n\nNumber of lines: %ld\n", lines * FIELDS);
return 0;
}
int compare(const void *row1, const void *row2)
{
const char *(*a)[8] = row1;
const char *(*b)[8] = row2;
return strcmp((*a)[2], (*b)[2]);
}
This, unfortunately (and predictably), produces a segmentation fault during sort time. I'd estimate it's due to how I'm handling the pointers and indexing, but the EXACT reason why is escaping me.
This seems like a really useful thing to know how to do well for the future, but it's a little more than I've personally tried to do with arrays and pointers in C before.
Thanks in advance.
EDIT: For interested parties, the above code, while not optimized, is at least functional. For suggestions on possible improvements, see answers here.
SECOND EDIT: As it turns out, sorting this via the system() command is also a viable option and takes up much less memory space. This can be accomplished via something like:
char cmd[50];
sprintf(cmd, "sort -o destinationFileName sourceFileName");
system(cmd);
The system can do this for you.
char cr; size_t lines = 0; while( cr != EOF )becausecris not initialized in the first iteration. Useint cr; size_t lines = 0; while ((cr = getc(fp)) != EOF) { if (cr == '\n') lines++; }. You could also have problems with premature EOF or never getting EOF because you usechar cr;but you must useint cr;to store EOF accurately. Do you need to check thatlinesis a multiple of 8 before going ahead and reading the data?compare()function to print(*a)[2]and(*b)[2](or the whole line of each array? Use it on a minimal (2-3 entry) data set and see whether you see what you expect to see.