pandas - write file with multiple separators. slow string concatenation

Question

I need to write files that have a format with a label based on a series of sets separated by dots, and a numeric value separated by a space. Some sets can be strings or integers and values can be integers or floats

eg:

a.1.1 0.19

a.1.2 1.23

1.5123.29 0

def write_myfile(df,file):
    cols = df.columns[:-1]
    df2 = pd.DataFrame()
    df2['Labels'] = df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1)
    df2['Values'] = df['value']
    df2.to_csv(file,sep = ' ',header=False,index=False)
return dd

So at the moment I use a pandas dataframe with the labels in the first columns, and the value in the final colum. It works for small files, but is incredibly inefficient. I need to write files with 3.5million or so lines.

Any suggestions?

my current method to improve speed is to write a file to csv using a "." separator, then read it again using "," separator specifying the dtype as string, then join the values. it's much quicker but seems a bit unreliable — AndyMoore
– AndyMoore, Commented Oct 20, 2016 at 11:43

jezrael · Accepted Answer · 2016-10-20 12:51:15Z

You can use nested list comprehension, because need remove NaN.

I think you have NaN in values, because use dropna.

First export all columns without last to numpy array by values and then to list. Last create new DataFrame by constructor:

cols = df.columns[:-1]
a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)])
               for x in df[cols].values.tolist()])
b = df['value']

df = pd.DataFrame({'Labels' : a, 'Values' : b})
print (df)
      Labels  Values
0      a.1.1    0.19
1        1.2    1.23
2      a.1.1    0.19
3        1.2    1.23
4  1.5123.29    0.00

Timings:

(len(df)=5k):

In [280]: %timeit (orig(df))
1 loop, best of 3: 22.2 s per loop

In [281]: %timeit (jez(df1))
10 loops, best of 3: 145 ms per loop

df = pd.DataFrame({
'value': {0: 0.19, 1: 1.23, 2: 0.19, 3: 1.23, 4: 0.0}, 
's': {0: 1, 1: 2, 2: 1, 3: 2, 4: 29}, 
'b': {0: 1, 1: 1, 2: 1, 3: 1, 4: 5123}, 
'a': {0: 'a', 1: np.nan, 2: 'a', 3: np.nan, 4: '1'}})
print (df)

     a     b   s  value
0    a     1   1   0.19
1  NaN     1   2   1.23
2    a     1   1   0.19
3  NaN     1   2   1.23
4    1  5123  29   0.00

df = pd.concat([df]*10000).reset_index(drop=True)

df1 = df.copy()

def orig(df):
    cols = df.columns[:-1]
    df2 = pd.DataFrame()
    df2['Labels'] = df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1)
    df2['Values'] = df['value']

    return (df2)


def jez(df): 
    cols = df.columns[:-1]
    a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)]) for x in df[cols].values.tolist()])
    b = df['value']
    df = pd.DataFrame({'Labels' : a, 'Values' : b})
    return (df)

print (orig(df))
print (jez(df1))

Another more efficient solution but it depends of data if works for you very well:

Compare by str(y) != 'nan' instead pd.notnull(y):

In [298]: %timeit (jez1(df1))
10 loops, best of 3: 114 ms per loop

def jez1(df): 
    cols = df.columns[:-1]
    a = pd.Series(['.'.join([str(y) for y in x if str(y) != 'nan']) for x in df[cols].values.tolist()])
    b = df['value']
    df = pd.DataFrame({'Labels' : a, 'Values' : b})
    return (df)

awesome, this is so much quicker. a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)]) for x in df[cols].values.tolist()]) is way quicker than df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1) many thanks

Collectives™ on Stack Overflow

pandas - write file with multiple separators. slow string concatenation

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related