25

I am struggling to convert a comma separated list into a multi column (7) data-frame.

print (type(mylist))

<type 'list'>
Print(mylist)


['AN,2__AAS000,26,20150826113000,-283.000,20150826120000,-283.000',         'AN,2__AE000,26,20150826113000,0.000,20150826120000,0.000',.........

The following creates a frame of a single column:

df = pd.DataFrame(mylist)

I have reviewed the inbuilt csv functionality for Pandas, however my csv data is held in a list. How can I simply covert the list into a 7 column data-frame.

Thanks in advance.

10
  • I can't reproduce your error : l=[['AA','2__000',26,20150826113000,-283.000,20150826120000,-283.000],['BB','2__DI9',26,20150826113000,0.000,20150826120000,0.000],[ 'CC','2__GH6',26,20150826113000,-269.000,20150826120000,-269.000]] pd.DataFrame(l) works fine Commented Aug 26, 2015 at 10:42
  • Can you post the output from print(mylist) Commented Aug 26, 2015 at 10:42
  • I have limited the results above as the are 2k rows. The dataframe is created however when i print(df) i get all the data followed by [1922 rows x 1 columns] Commented Aug 26, 2015 at 10:47
  • Can you post just the first few rows then, you have to show how the data is stored in your list so we can reproduce your error Commented Aug 26, 2015 at 10:49
  • As further background, the data is originally from a file which had mixture of CSV data and some metadata which i stripped out and passed the CSV rows to a list. Commented Aug 26, 2015 at 10:50

3 Answers 3

46

You need to split each string in your list:

import  pandas as pd

df = pd.DataFrame([sub.split(",") for sub in l])
print(df)

Output:

   0         1   2               3         4               5         6
0  AN  2__AS000  26  20150826113000  -283.000  20150826120000  -283.000
1  AN   2__A000  26  20150826113000     0.000  20150826120000     0.000
2  AN  2__AE000  26  20150826113000  -269.000  20150826120000  -269.000
3  AN  2__AE000  26  20150826113000  -255.000  20150826120000  -255.000
4  AN   2__AE00  26  20150826113000  -254.000  20150826120000  -254.000

If you know how many lines to skip in your csv you can do it all with read_csv using skiprows=lines_of_metadata:

import  pandas as pd

df = pd.read_csv("in.csv",skiprows=3,header=None)
print(df)

Or if each line of the metadata starts with a certain character you can use comment:

df = pd.read_csv("in.csv",header=None,comment="#")  

If you need to specify more then one character you can combine itertools.takewhile which will drop lines starting with xxx:

import pandas as pd
from itertools import dropwhile
import csv
with open("in.csv") as f:
    f = dropwhile(lambda x: x.startswith("#!!"), f)
    r = csv.reader(f)
    df = pd.DataFrame().from_records(r)

Using your input data adding some lines starting with #!!:

#!! various
#!! metadata
#!! lines
AN,2__AS000,26,20150826113000,-283.000,20150826120000,-283.000
AN,2__A000,26,20150826113000,0.000,20150826120000,0.000
AN,2__AE000,26,20150826113000,-269.000,20150826120000,-269.000
AN,2__AE000,26,20150826113000,-255.000,20150826120000,-255.000
AN,2__AE00,26,20150826113000,-254.000,20150826120000,-254.000

Outputs:

    0         1   2               3         4               5         6
0  AN  2__AS000  26  20150826113000  -283.000  20150826120000  -283.000
1  AN   2__A000  26  20150826113000     0.000  20150826120000     0.000
2  AN  2__AE000  26  20150826113000  -269.000  20150826120000  -269.000
3  AN  2__AE000  26  20150826113000  -255.000  20150826120000  -255.000
4  AN   2__AE00  26  20150826113000  -254.000  20150826120000  -254.000
Sign up to request clarification or add additional context in comments.

7 Comments

Great work, appreciate the help this worked perfectly. Im very happy.
@user636322, no worries, I added a couple of ways to do it with read_csv, what does the metadata actually look like, do you know how many lines are there or do the lines start with a common character?
The metadata is basically repeating header information throughout the csv file. Im not able to predict location so i just used a loop to remove specifically( if row.startswith('xxx')).
@user636322, you can still do it when reading from the csv, what is the xxx in startswith('xxx')
Im actually selecting the valid data with the loop, and therefore eliminating the invalid data, in the example above row.startswith('AN').
|
1

you can covert the list into a 7 column data-frame in the following way:

import pandas as pd

df = pd.read_csv(filename, sep=',')

1 Comment

Try to add some description with your code. What is does ? Why it work ?
-1

I encounter a similar problem. I solve it by this way.

def lrsplit(line):
    left, *_ , right = line.split('-')
    mid = '-'.join(_)
    return left, mid, right.strip()
example = pd.DataFrame(lrsplit(line) for line in open("example.csv"))
example.columns = ['location', 'position', 'company']

Result:

    location    position    company
0   india   manager intel
1   india   sales-manager   amazon
2   banglore    ccm- head - county  jp morgan

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.