0

I have a piece of software that generates a rather large text file full of information about files in a directory. There are often several thousand files. Each one gets a set of informational entries, looking something like:

number
number
IMPORTANT NUMBER
info
info
info
info
info

These repeat. The text file will have the same eight lines for each file in the directory.

I'm supposed to sort this text file by IMPORTANT NUMBERs, the values that appear on the 3rd line, then the 3+8th line, then the 3 + 8*2 line, etc.

Currently, I'm reading them into a multidimensional character array, looking like:

[number][number][IMPORTANT NUMBER 1][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 2][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 3][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 4][info][info][info][info][info]

etc.

The idea is to sort each set of 8 entries by the important numbers in ascending order. If, for example, my array looked like this:

[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number1][number1][1][info1][info1][info1][info1][info1]
[number4][number4][4][info4][info4][info4][info4][info4]

After sorting, it would look like:

[number1][number1][1][info1][info1][info1][info1][info1]
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number4][number4][4][info4][info4][info4][info4][info4]

...with the value in arr[2] (1,2,3,4...) being used for the sorting.

The issue is that the information stored in the other columns tends to vary in size. arr[3] might have a length of 30 characters. arr[4] might have a length in excess of 5000. Doing this for a lot of data adds up quickly enough that I don't want to just allocate set sizes of whatever that max length might be, especially if I'm just going to use a tiny fraction of it at a time in most of the cases.

I've found a lot of good answers on using qsort, but very few on sorting large multidimensional string arrays. I'd also LIKE to use qsort, because I'd prefer not to reinvent the wheel and I doubt anything I'd write would be as efficient.

If anyone could shed some light on how this might be accomplished, I'd appreciate it.

Current code is:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define FIELDS 8

int compare(const void *row1, const void *row2);

int main(int argc, char *argv[])
{
    // (1) - Open File
    const char fname[] = "arrayFile.txt";

    FILE *fp = fopen(fname, "r");

    printf("Opened file: %s\n", fname); 

    // (2) - Count Lines
    char cr;
    size_t lines = 0;
    while (cr != EOF)
    {
        if (cr == '\n') 
        {
            lines++;
        }
        cr = getc(fp);
    } 
    rewind(fp);

    // (3) - Populate Array
    char *data[lines / FIELDS][FIELDS];
    lines = lines / FIELDS;
    size_t n;

    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            data[i][j] = NULL;
            size_t n = 0;
            getline(&data[i][j], &n, fp);
        }    
    }

    // (4) - Print Array Before
    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            printf("%s", data[i][j]);
        }
    }

    printf("\n\nNot sorted\n\n");

    // (5) - Sort Array
    qsort(data, lines, sizeof(data[0]), compare);

    printf("\n\nsorted\n\n");

    // (6) - Print Array After
    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            printf("%s", data[i][j]);
            free(data[i][j]);
        }
    }

    // Close File
    fclose(fp);

    printf("\n\nNumber of files: %ld\n", lines);
    printf("\n\nNumber of lines: %ld\n", lines * FIELDS);

    return 0;
}

int compare(const void *row1, const void *row2)
{
    const char *(*a)[8] = row1;
    const char *(*b)[8] = row2;

    return strcmp((*a)[2], (*b)[2]);
}

This, unfortunately (and predictably), produces a segmentation fault during sort time. I'd estimate it's due to how I'm handling the pointers and indexing, but the EXACT reason why is escaping me.

This seems like a really useful thing to know how to do well for the future, but it's a little more than I've personally tried to do with arrays and pointers in C before.

Thanks in advance.

EDIT: For interested parties, the above code, while not optimized, is at least functional. For suggestions on possible improvements, see answers here.

SECOND EDIT: As it turns out, sorting this via the system() command is also a viable option and takes up much less memory space. This can be accomplished via something like:

char cmd[50];
sprintf(cmd, "sort -o destinationFileName sourceFileName");
system(cmd);

The system can do this for you.

6
  • 2
    You have UB in the lines char cr; size_t lines = 0; while( cr != EOF ) because cr is not initialized in the first iteration. Use int cr; size_t lines = 0; while ((cr = getc(fp)) != EOF) { if (cr == '\n') lines++; }. You could also have problems with premature EOF or never getting EOF because you use char cr; but you must use int cr; to store EOF accurately. Do you need to check that lines is a multiple of 8 before going ahead and reading the data? Commented Apr 14, 2022 at 19:42
  • 3
    Structures are better suited for this job. Commented Apr 14, 2022 at 19:43
  • 2
    Would it be reasonable to store your data differently? For example, using an array of pointer-to-struct, where each struct contains one record for your data? Commented Apr 14, 2022 at 19:43
  • 1
    Have you hacked your compare() function to print (*a)[2] and (*b)[2] (or the whole line of each array? Use it on a minimal (2-3 entry) data set and see whether you see what you expect to see. Commented Apr 14, 2022 at 19:48
  • 1
    "it's guarantee that the line count is some value of 8N" --> relying on that makes for weak code. Code defensively, not hopefully. Commented Apr 14, 2022 at 20:24

2 Answers 2

1

There are multiple problems in your code:

  • you should test for fopen potential failure to open the file.

  • char cr; should be int cr; to handle all 257 possible values returned by getc(), assuming 8-bit bytes.

  • cr is uninitialized during the first iteration of while (cr != EOF). You should write this loop as:

      int cr;
      while ((cr = getc(fp)) != EOF) {
          lines += (cr == '\n');
      }
    
  • as documented by chux, an initial pass to read the whole file is unnecessary, you should instead reallocate the array as you read the file.

  • char *data[lines / FIELDS][FIELDS]; might define an array that is too large for automatic storage, causing a stack overflow

  • The correct format specifier for size_t is %zu, not %ld. size_t is not a long and might not even have the same size or argument passing convention.

  • the compare function uses too many indirections in the type conversion. While your type conversions are probably OK except for const correctness, they are difficult to grasp for most programmers. You should use a simpler approach:

    int compare(const void *row1, const void *row2) {
         char * const *a = row1;
         char * const *b = row2;
    
         return strcmp(a[2], b[2]);
    }
    
  • note however that the above function will sort the important numbers in lexicographical order, placing 11 between 1 and 2. You probably want numeric order instead:

    int compare(const void *row1, const void *row2) {
         char * const *a = row1;
         char * const *b = row2;
         long na = strtol(a, NULL, 10);
         long nb = strtol(b, NULL, 10);
         return (na > nb) - (na < nb);
    }
    
Sign up to request clarification or add additional context in comments.

1 Comment

Appreciate the review. This was mostly a proof-of-concept; I'll make sure the final version is a little tighter.
0

Many concerns - I'll mention only one

Counting lines

Do not count lines. (Drop step 2.) Rather than make 2 passes, use 1 pass and instead adjust data as needed.

Some untested code to give OP an idea:

  char *(*data)[FIELDS] = NULL;
  size_t records_n = 0;  // Allocation total
  size_t records_i;      // Allocation used

  for (records_i = 0; records_i < SIZE_MAX; records_i++) {
    if (records_i == records_n) {
      size_t records_new_n = records_n * 2 + 1;  // Double the allocation
      char *(*newdata)[FIELDS] = realloc(data, sizeof data[0] * records_new_n);
      if (newdata == NULL) {
        free(data);
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);
      }
      data = newdata;
      records_n = records_new_n;
    }
    int f;
    for (f = 0; f < FIELDS; f++) {
      data[records_i][f] = NULL;
      size_t n = 0;
      if (getline(&data[records_i][f], &n, fp) == -1) {
        if (f == 0) {
          break;
        }
        fprintf(stderr, "Record ended early.\n");
        break; // Or maybe fail?
      }
      // Lop off potential '\n'
      if (n > 0 && data[records_i][f][n - 1] == '\n') {
        data[records_i][f][--n] = 0;
      }
    }
    if (f < FIELDS) {
      break;
    }
  }
  // Perhaps right-size data to records_i here?  Not shown.

  // ... Use data

  // When all done, free all lines allocated (not shown) and ...
  free(data);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.