Pandas/Python read file with different separators

Question

I have a .txt file as follows:

columnA;columnB;columnC;columnD
2022040200000000000000000000011    8000702   79005889  SPECIAL_AGENCY

You can observe that the names of the columns are separated by a semi column ;, however, row values, have different separators. In this example, columnA has 3 spaces, columnB has 3, columnC has 2, and columnD has 7.

It is important to clarify, that I need to keep the spaces, hence the “real” separator is the last space.

Considering I have a schema, that tells me for each column what is the amount of spaces (separators?) I have, how can I turn it into a pandas dataframe?

no, updated, and thanks for the clarification! I have a schema that tells me for each row what is the separator — robsanna
– robsanna, Commented Jan 20, 2023 at 16:48
so one row can occupy multiple lines (as shown in your post)? — RomanPerekhrest
– RomanPerekhrest, Commented Jan 20, 2023 at 16:49
Is that line wrap in the file, or just a formatting mistake in typing the question? — suvayu
– suvayu, Commented Jan 20, 2023 at 16:49
Is the LONDON line part of the previous line, or a new line? What column does it go into? — Nick ODell
– Nick ODell, Commented Jan 20, 2023 at 16:50

Timeless · Accepted Answer · 2023-01-21 13:42:44Z

4

One way is to use a double regex separator with (|) and pandas.read_csv :

df = pd.read_csv("/tmp/file.txt", sep=";|(?<=\d)\s+(?=\B)", engine="python")

Output :

print(df)

                           columnA  columnB   columnC                       columnD
0  2022040200000000000000000000011  8000702  79005889   SPECIAL_AGENCY       LONDON

NB: If needed, you can add pandas.Series.replace to clean up the extra (\s) in the columnD.

edited Jan 21, 2023 at 13:42

answered Jan 20, 2023 at 17:12

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

robsanna Over a year ago

Does this solution keep the spaces as well? Or it trims out everything?

Timeless Over a year ago

If you mean by that, the whitespaces between for example SPECIAL AGENCY and LONDON, yes they will not be stripped (as you can see in the output I shared). Also, my code is based on the format of your example, so if you failed to describe your actual dataset, I'm not sure that you'll get the expected output.

suvayu · Accepted Answer · 2023-01-22 17:39:52Z

4

The following should work, however it has the downside of reading the whole file into memory first before creating the dataframe. That could pose a problem if your file is large.

In [17]: data = Path("data.txt").read_text().splitlines()

In [18]: hdr = data[0].split(";")

In [19]: df = pd.DataFrame([row.split() for row in data[1:]], columns=hdr)

In [20]: df
Out[20]: 
                           columnA  columnB   columnC         columnD
0  2022040200000000000000000000011  8000702  79005889  SPECIAL_AGENCY

edited Jan 22, 2023 at 17:39

answered Jan 20, 2023 at 17:09

suvayu

4,7143 gold badges33 silver badges40 bronze badges

2 Comments

robsanna Over a year ago

There is no 5th column in my example. Apologies for the typo, but I don’t see a new line in my snippet

suvayu Over a year ago

@robsanna okay, thanks for the clarification. In that case the answer by Timeless should work for you (accept it if so).

Collectives™ on Stack Overflow

Pandas/Python read file with different separators

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related