0

I have a .txt file as follows:

columnA;columnB;columnC;columnD
2022040200000000000000000000011    8000702   79005889  SPECIAL_AGENCY

You can observe that the names of the columns are separated by a semi column ;, however, row values, have different separators. In this example, columnA has 3 spaces, columnB has 3, columnC has 2, and columnD has 7.

It is important to clarify, that I need to keep the spaces, hence the “real” separator is the last space.

Considering I have a schema, that tells me for each column what is the amount of spaces (separators?) I have, how can I turn it into a pandas dataframe?

10
  • do all rows have fixed space gaps between columns? Commented Jan 20, 2023 at 16:45
  • no, updated, and thanks for the clarification! I have a schema that tells me for each row what is the separator Commented Jan 20, 2023 at 16:48
  • so one row can occupy multiple lines (as shown in your post)? Commented Jan 20, 2023 at 16:49
  • 1
    Is that line wrap in the file, or just a formatting mistake in typing the question? Commented Jan 20, 2023 at 16:49
  • Is the LONDON line part of the previous line, or a new line? What column does it go into? Commented Jan 20, 2023 at 16:50

2 Answers 2

4

One way is to use a double regex separator with (|) and pandas.read_csv :

df = pd.read_csv("/tmp/file.txt", sep=";|(?<=\d)\s+(?=\B)", engine="python")

Output :

print(df)
​
                           columnA  columnB   columnC                       columnD
0  2022040200000000000000000000011  8000702  79005889   SPECIAL_AGENCY       LONDON

NB: If needed, you can add pandas.Series.replace to clean up the extra (\s) in the columnD.

Sign up to request clarification or add additional context in comments.

2 Comments

Does this solution keep the spaces as well? Or it trims out everything?
If you mean by that, the whitespaces between for example SPECIAL AGENCY and LONDON, yes they will not be stripped (as you can see in the output I shared). Also, my code is based on the format of your example, so if you failed to describe your actual dataset, I'm not sure that you'll get the expected output.
4

The following should work, however it has the downside of reading the whole file into memory first before creating the dataframe. That could pose a problem if your file is large.

In [17]: data = Path("data.txt").read_text().splitlines()

In [18]: hdr = data[0].split(";")

In [19]: df = pd.DataFrame([row.split() for row in data[1:]], columns=hdr)

In [20]: df
Out[20]: 
                           columnA  columnB   columnC         columnD
0  2022040200000000000000000000011  8000702  79005889  SPECIAL_AGENCY

2 Comments

There is no 5th column in my example. Apologies for the typo, but I don’t see a new line in my snippet
@robsanna okay, thanks for the clarification. In that case the answer by Timeless should work for you (accept it if so).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.