1

I am trying to import a txt file (into a DataFrame) that looks like this

12345           20191113418824004           S20191013
23456           20191030T20.60XA            X20191230

The data frame must look like

memberid    Date1      Code        Flag   Date2
12345       20191113   418824004   S      20191013
23456       20191030   T20.60XA    X      20191230

So far I tried doing

data = pd.read_csv ("diag.txt",delimiter = "\t")
df = pd.DataFrame(data, columns= ['memberid','Date1','Code','Flag','Date2'])

but I am getting all the columns as NAN. Not sure why even the memberid column is not picking up. Any guidance is much appreciated.

Here are the Rules for seperation-

  1. Lets take the first row: 12345 20191113418824004 S20191013. The first continuous series of numbers (until we hit the first space) 12345 are the memberid
  2. In the next blob / chunk of numbers we encounter (20191113418824004) the first 8 numbers of this become the Date1. Whatever is left after the first 8 numbers becomes the Code (In this case 20191113 becomes the date and the rest -418824004 is the code )
  3. In the next chunk of data we encounter S20191013 . The first letter becomes the Flag and the rest becomes the Date2. This third "column" if I may say is always varchar(9). So in this case S is the flag and the rest 20191013 is Date2.

P.S This is all random mock data that I manually generated. No sensitive information.

2
  • 1
    I don't know if there's a copy and paste issue with the txt file but are the values actually stuck together like that in the file or are there tabs present between S20191013 for example? And if the values are actually stuck together like that, can you outline the rules for how they should be separated into columns? Commented Jun 6, 2021 at 18:01
  • 1
    Hello, Its not a copy paste issue. I have updated my question with the rules for how they should be seperated. I'd higly obiliged if you could take a look. Thanks for your inputs. Commented Jun 6, 2021 at 19:15

2 Answers 2

2

Try:

df = pd.read_csv("your_file.txt", sep=r"\s+", header=None)
df[["Date1", "Code"]] = df.pop(1).str.extract(r"(\d{8})(.*)", expand=True)
df[["Flag", "Date2"]] = df.pop(2).str.extract(r"([A-Z])(.*)", expand=True)
df = df.rename(columns={0: "memberid"})
print(df)

Prints:

   memberid     Date1       Code Flag     Date2
0     12345  20191113  418824004    S  20191013
1     23456  20191030   T20.60XA    X  20191230
Sign up to request clarification or add additional context in comments.

Comments

0

If you have fixed-wide data, you may want to use pandas.read_fwf instead of read_csv so that you don't have to parse that afterwards but directly have the proper specification when reading, which would look the following (widths is a list with each of the column widths, not a cumulated sum of them, be careful with those):

In [1]: pd.read_fwf("test.txt", widths=[16, 8, 20, 1, 30], header=None)
Out[1]:
       0         1          2  3         4
0  12345  20191113  418824004  S  20191013
1  23456  20191030   T20.60XA  X  20191230

If you want to have the column names, you may pass them as names parameter:

In [34]: pd.read_fwf("test.txt", widths=[16, 8, 20, 1, 30], header=None, names=['memberid','Date1','Code','Flag','Date2'])
Out[34]:
   memberid     Date1       Code Flag     Date2
0     12345  20191113  418824004    S  20191013
1     23456  20191030   T20.60XA    X  20191230

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.