Skip to main content
added 2496 characters in body
Source Link
Reinderien
  • 71.2k
  • 5
  • 76
  • 257

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns.

Since you're using Pandas, your dictionary format is probably not a good idea. Consider instead forming a second dataframe whose column names are based on the keys. The key-value separation can be done with Pandas in a vectorised manner:

Suggested

from io import StringIO
import pandas as pd


def read_mixed(fname) -> tuple[pd.DataFrame, pd.DataFrame]:
    df = pd.read_csv(fname)               # Comment and data rows are mixed in
df    is_data = df[df[dfdf[df.columns[-1]].notna()  # Assume that "short" rows are comments, "long" rows are data
    data = df[is_data]                    # All data rows in one frame

    comments = df[~is_data]                        # Comment rows are "everything left"
    comments.dropna(axis='columns', inplace=True)  # Drop NA columns
    first_col = comments.columns[0]                # Name of the first column, disposable
    comments[first_col] = comments[first_col].str.removeprefix('Comment line: ')  # Drop the prefix on all rows
    comments.columns = comments.iloc[0, :].str.split('=', 1).str[0]  # Set columns to be keys from comment cells

    for col in comments.columns:
        comments[col] = comments[col].str.split('=', 1).str[1]

    return data, comments

# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":

    fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
    output = read_mixed(fstr)

    print(output)

This produces a result equivalent to that of your pd.concat and may be faster (test this).

Output

Or if this is too variable to rely on, you could filter by rows whose first column is numeric.

    a1 a2  a3  b1    b2    b3   c1   c2    c3
1   12  1  23   5  23.0   5.0  3.0  4.0   5.0
2   12  1  23   5  23.0   5.0  3.0  4.0   5.0
3   12  1  23   5  23.0   5.0  3.0  4.0   5.0
4   12  1  23   5  23.0   5.0  3.0  4.0   5.0
5   12  1  23   5  23.0   5.0  3.0  4.0   5.0
7   24  2  46  10  46.0  10.0  6.0  8.0  10.0
8   24  2  46  10  46.0  10.0  6.0  8.0  10.0
9   24  2  46  10  46.0  10.0  6.0  8.0  10.0
10  24  2  46  10  46.0  10.0  6.0  8.0  10.0
11  24  2  46  10  46.0  10.0  6.0  8.0  10.0
13  12  1  23   5  23.0   5.0  3.0  4.0   5.0
14  12  1  23   5  23.0   5.0  3.0  4.0   5.0
15  12  1  23   5  23.0   5.0  3.0  4.0   5.0

You'll still need to find your Comment line by way of regex or similar.

0  LINES      CAR1       CAR2       CAR3
0      5  Top Fuel      Funny  Pro Stock
6      5     Funny  Pro Stock   Top Fuel
12     3     Funny   Top Fuel  Pro Stock

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns:

df = pd.read_csv(fname)
df = df[df[df.columns[-1]].notna()]

This produces a result equivalent to that of your pd.concat and may be faster (test this).

Or if this is too variable to rely on, you could filter by rows whose first column is numeric.

You'll still need to find your Comment line by way of regex or similar.

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns.

Since you're using Pandas, your dictionary format is probably not a good idea. Consider instead forming a second dataframe whose column names are based on the keys. The key-value separation can be done with Pandas in a vectorised manner:

Suggested

from io import StringIO
import pandas as pd


def read_mixed(fname) -> tuple[pd.DataFrame, pd.DataFrame]:
    df = pd.read_csv(fname)               # Comment and data rows are mixed in
    is_data = df[df.columns[-1]].notna()  # Assume that "short" rows are comments, "long" rows are data
    data = df[is_data]                    # All data rows in one frame

    comments = df[~is_data]                        # Comment rows are "everything left"
    comments.dropna(axis='columns', inplace=True)  # Drop NA columns
    first_col = comments.columns[0]                # Name of the first column, disposable
    comments[first_col] = comments[first_col].str.removeprefix('Comment line: ')  # Drop the prefix on all rows
    comments.columns = comments.iloc[0, :].str.split('=', 1).str[0]  # Set columns to be keys from comment cells

    for col in comments.columns:
        comments[col] = comments[col].str.split('=', 1).str[1]

    return data, comments

# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":

    fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
    output = read_mixed(fstr)

    print(output)

Output

    a1 a2  a3  b1    b2    b3   c1   c2    c3
1   12  1  23   5  23.0   5.0  3.0  4.0   5.0
2   12  1  23   5  23.0   5.0  3.0  4.0   5.0
3   12  1  23   5  23.0   5.0  3.0  4.0   5.0
4   12  1  23   5  23.0   5.0  3.0  4.0   5.0
5   12  1  23   5  23.0   5.0  3.0  4.0   5.0
7   24  2  46  10  46.0  10.0  6.0  8.0  10.0
8   24  2  46  10  46.0  10.0  6.0  8.0  10.0
9   24  2  46  10  46.0  10.0  6.0  8.0  10.0
10  24  2  46  10  46.0  10.0  6.0  8.0  10.0
11  24  2  46  10  46.0  10.0  6.0  8.0  10.0
13  12  1  23   5  23.0   5.0  3.0  4.0   5.0
14  12  1  23   5  23.0   5.0  3.0  4.0   5.0
15  12  1  23   5  23.0   5.0  3.0  4.0   5.0
0  LINES      CAR1       CAR2       CAR3
0      5  Top Fuel      Funny  Pro Stock
6      5     Funny  Pro Stock   Top Fuel
12     3     Funny   Top Fuel  Pro Stock
Source Link
Reinderien
  • 71.2k
  • 5
  • 76
  • 257

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns:

df = pd.read_csv(fname)
df = df[df[df.columns[-1]].notna()]

This produces a result equivalent to that of your pd.concat and may be faster (test this).

Or if this is too variable to rely on, you could filter by rows whose first column is numeric.

You'll still need to find your Comment line by way of regex or similar.