5

I have a data frame looks like this:

P Q L
1 2 3
2 3 
4 5 6,7

The objective is to check if there is any value in L, if yes, extract the value on L and P column:

P L
1 3
4,6
4,7

Note there might more than one values in L, in the case of more than 1 value, I would need two rows.

Bellow is my current script, it cannot generate the expected result.

df2 = []
ego
other
newrow = []

for item in data_DF.iterrows():
    if item[1]["L"] is not None:
        ego = item[1]['P']
        other = item[1]['L']
        newrow = ego + other + "\n"
        df2.append(newrow)

data_DF2 = pd.DataFrame(df2)
3
  • 1
    Are you values in L lists, a string of numbers, etc.. can you post raw input data and code to reproduce your df Commented Dec 3, 2015 at 11:10
  • You need to post the raw data, so we can see if the values in L are '' (empty string), NaN, or something else. And did they come in from pd.read_csv(), and if so, which dtypes and arguments were specified? You can tell read_csv how you want it to handle NaNs, and you can defined '' as a NaN value. So you can prevent this issue ever arising. Commented Jan 21, 2022 at 13:45
  • This is avoidable, and a possible non-issue. You're probably creating the issue yourself, possibly with pd.read_csv(). You haven't given enough detail to tell. Commented Jan 21, 2022 at 15:23

3 Answers 3

2

First, you can extract all rows of the L and P columns where L is not missing like so:

df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')

Next, you can deal with the multiple values in some of the remaining L rows as follows:

df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()

To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.

Sign up to request clarification or add additional context in comments.

Comments

1

First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.

print df
   P  Q    L
0  1  2    3
1  2  3  NaN
2  4  5  6,7

s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0    3
2    6
2    7
Name: L, dtype: object

df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
   P    L
0  1    3
1  2  NaN
2  4    6
2  4    7
df = df.dropna().reset_index(drop=True)
print df
   P  L
0  1  3
1  4  6
2  4  7

Comments

0

I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:

import pandas as pd

df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
    if row['company_id'] == 12345 or row['company_id'] == 56789:
        df2 = df2.append(row, ignore_index = True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.