0

Hi I have a dataframe df that has headers like this:

DATE    COL1    COL2   ...    COL10
date1    a       b      
...     ...     ...            ...

and so on        

Basically each row is just a date and then a bunch of columns on the same row that have some text in or they don't.

From this I want to create a new df df2 that has a row for each non blank 'cell' in the original data frame consisting of the date and the text from that cell. From the above example we could get

df2=

DATE    COL
date1    a
date1    b

In pseudocode what I want to achieve is:

df2 = blank df
for row in df:
    for column in row:
        if cell is not empty:
            append to df2 a row consisting of the date for that row and the value in that cell

So far I have

import pandas as pd
df = pd.read_csv("data2.csv")

output_df = pd.DataFrame(columns=['Date', 'Col'])

Basically I have read in the df, and created the new df to begin populating.

Now I am stuck, some investigation has told me I should not use iterrows() as it is not efficient and bad practise and I have 300k+ rows in df.

Any suggestions how I can do this please?

3 Answers 3

1

Use df.melt:

data = [{'date': f'date{j}', **{f"col{i}": val for i, val in enumerate('abc')}} for j in range(5)]

df = pd.DataFrame(data)

    date col0 col1 col2
0  date0    a    b    c
1  date1    a    b    c
2  date2    a    b    c
3  date3    a    b    c
4  date4    a    b    c


df2 = df.melt(
    id_vars=['date'], 
    value_vars=df.filter(like='col').columns, 
    value_name='Col'
)[['date', 'Col']]


# to get the ordering the way you want
df2 = df2.sort_values(by='date').reset_index(drop=True)
     date Col
0   date0   a
1   date0   b
2   date0   c
3   date1   a
4   date1   b
5   date1   c
6   date2   a
7   date2   b
8   date2   c
9   date3   a
10  date3   b
11  date3   c
12  date4   a
13  date4   b
14  date4   c

Then, you can filter out any null values from Col:

df2 = df2[df2['Col'].apply(bool)]
Sign up to request clarification or add additional context in comments.

Comments

1

You need to turn the blank cells into NA.

ie

df[df == ''] = np.nan

df.metl('DATE').dropna()

Comments

0

You can join the multiple columns to one list

s = df.filter(like='COL').apply(lambda row: row[row.notna()].tolist(), axis=1)

Then explode on that list

df_ = pd.DataFrame({'DATE':df['DATE'], 'COL': s})
df_ = df_.explode('COL')
print(df_)

    DATE COL
0  date1   a
0  date1   b
1  date2   c
1  date2   d

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.