1

I need to write files that have a format with a label based on a series of sets separated by dots, and a numeric value separated by a space. Some sets can be strings or integers and values can be integers or floats

eg:

a.1.1 0.19

a.1.2 1.23

1.5123.29 0

def write_myfile(df,file):
    cols = df.columns[:-1]
    df2 = pd.DataFrame()
    df2['Labels'] = df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1)
    df2['Values'] = df['value']
    df2.to_csv(file,sep = ' ',header=False,index=False)
return dd

So at the moment I use a pandas dataframe with the labels in the first columns, and the value in the final colum. It works for small files, but is incredibly inefficient. I need to write files with 3.5million or so lines.

Any suggestions?

1
  • my current method to improve speed is to write a file to csv using a "." separator, then read it again using "," separator specifying the dtype as string, then join the values. it's much quicker but seems a bit unreliable Commented Oct 20, 2016 at 11:43

1 Answer 1

1

You can use nested list comprehension, because need remove NaN.

I think you have NaN in values, because use dropna.

First export all columns without last to numpy array by values and then to list. Last create new DataFrame by constructor:

cols = df.columns[:-1]
a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)])
               for x in df[cols].values.tolist()])
b = df['value']

df = pd.DataFrame({'Labels' : a, 'Values' : b})
print (df)
      Labels  Values
0      a.1.1    0.19
1        1.2    1.23
2      a.1.1    0.19
3        1.2    1.23
4  1.5123.29    0.00

Timings:

(len(df)=5k):

In [280]: %timeit (orig(df))
1 loop, best of 3: 22.2 s per loop

In [281]: %timeit (jez(df1))
10 loops, best of 3: 145 ms per loop

df = pd.DataFrame({
'value': {0: 0.19, 1: 1.23, 2: 0.19, 3: 1.23, 4: 0.0}, 
's': {0: 1, 1: 2, 2: 1, 3: 2, 4: 29}, 
'b': {0: 1, 1: 1, 2: 1, 3: 1, 4: 5123}, 
'a': {0: 'a', 1: np.nan, 2: 'a', 3: np.nan, 4: '1'}})
print (df)

     a     b   s  value
0    a     1   1   0.19
1  NaN     1   2   1.23
2    a     1   1   0.19
3  NaN     1   2   1.23
4    1  5123  29   0.00

df = pd.concat([df]*10000).reset_index(drop=True)

df1 = df.copy()

def orig(df):
    cols = df.columns[:-1]
    df2 = pd.DataFrame()
    df2['Labels'] = df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1)
    df2['Values'] = df['value']

    return (df2)


def jez(df): 
    cols = df.columns[:-1]
    a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)]) for x in df[cols].values.tolist()])
    b = df['value']
    df = pd.DataFrame({'Labels' : a, 'Values' : b})
    return (df)

print (orig(df))
print (jez(df1))

Another more efficient solution but it depends of data if works for you very well:

Compare by str(y) != 'nan' instead pd.notnull(y):

In [298]: %timeit (jez1(df1))
10 loops, best of 3: 114 ms per loop

def jez1(df): 
    cols = df.columns[:-1]
    a = pd.Series(['.'.join([str(y) for y in x if str(y) != 'nan']) for x in df[cols].values.tolist()])
    b = df['value']
    df = pd.DataFrame({'Labels' : a, 'Values' : b})
    return (df)
Sign up to request clarification or add additional context in comments.

1 Comment

awesome, this is so much quicker. a = pd.Series(['.'.join([str(y) for y in x if pd.notnull(y)]) for x in df[cols].values.tolist()]) is way quicker than df[cols].apply(lambda x: '.'.join(x.dropna().astype(str).values.tolist()), axis=1) many thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.