2

I have a column Values that contain the category examples: New Va,P Va,B... I need to create one column for each category and your respective value

       Date  Column1 Total        Type Values
0       NaN      NaN   NaN       Type1    5.1
1       NaN  Column2   Sum       Type1 New Va
2   04/2019        2   NaN       Type1    NaN
3   05/2019        2   NaN       Type1    NaN
4   06/2019        2     2       Type1     14
5   07/2019        4     4       Type1     16
6       NaN      NaN   NaN  Unnamed: 4    NaN
7       NaN  Column2   Sum  Unnamed: 4   P Va
8   04/2019        2   NaN  Unnamed: 4    NaN
9   05/2019        2   NaN  Unnamed: 4    NaN
10  06/2019        2     2  Unnamed: 4     10
11  07/2019        4     4  Unnamed: 4     15
12      NaN      NaN   NaN  Unnamed: 5    NaN
13      NaN  Column2   Sum  Unnamed: 5      B
14  04/2019        2   NaN  Unnamed: 5    NaN
15  05/2019        2   NaN  Unnamed: 5    NaN
16  06/2019        2     2  Unnamed: 5      8
17  07/2019        4     4  Unnamed: 5      7
18      NaN      NaN   NaN       Type2    4.9

Considering that NAN Data values from Date column will be removed, the expected result is:

       Date  Column1 Total        Type Values New Va   P Va  B
0       NaN      NaN   NaN       Type1    5.1   
1       NaN  Column2   Sum       Type1      N
2   04/2019        2   NaN       Type1    NaN   0
3   05/2019        2   NaN       Type1    NaN   0
4   06/2019        2     2       Type1     14   14
5   07/2019        4     4       Type1     16   16
6       NaN      NaN   NaN  Unnamed: 4    NaN
7       NaN  Column2   Sum  Unnamed: 4      P
8   04/2019        2   NaN  Unnamed: 4    NaN       0
9   05/2019        2   NaN  Unnamed: 4    NaN       0
10  06/2019        2     2  Unnamed: 4     10       10
11  07/2019        4     4  Unnamed: 4     15       15
12      NaN      NaN   NaN  Unnamed: 5    NaN
13      NaN  Column2   Sum  Unnamed: 5      B            
14  04/2019        2   NaN  Unnamed: 5    NaN              0
15  05/2019        2   NaN  Unnamed: 5    NaN              0
16  06/2019        2     2  Unnamed: 5      8              8
17  07/2019        4     4  Unnamed: 5      7              7
18      NaN      NaN   NaN       Type2    4.9

After that, I will group by the values from Date to keep the values New Pa, P Va, and B in the same row. I'm trying to use the for to create new columns identifying the

 df['New Va'] = np.where(df['Values'].str.contains('New Va'),'N',np.NaN)

However, all lines differents from P and B are NaN, and I don't have the numbers like example above

2 Answers 2

2
import re  # Not strictly necessary, but it might speed things up for lots of data

pat = re.compile("^[a-zA-Z\s]*$")            # compile is what might speed things up
v = df.Values[df.Column1.notna()].fillna(0) 
a = ~v.str.match(pat).fillna(False)          # mask of things that don't match
keys = pd.unique(v[~a])                      # get unique matches
fill = dict.fromkeys(keys, '')
d = pd.get_dummies(v.mask(a).ffill())[a]
new = d.mul(pd.to_numeric(v[a]), axis=0).where(d == 1, '')[keys]

df.join(new).fillna(fill)

       Date  Column1 Total        Type  Values New Va P Va  B
0       NaN      NaN   NaN       Type1     5.1               
1       NaN  Column2   Sum       Type1  New Va               
2   04/2019        2   NaN       Type1     NaN      0        
3   05/2019        2   NaN       Type1     NaN      0        
4   06/2019        2     2       Type1      14     14        
5   07/2019        4     4       Type1      16     16        
6       NaN      NaN   NaN  Unnamed: 4     NaN               
7       NaN  Column2   Sum  Unnamed: 4    P Va               
8   04/2019        2   NaN  Unnamed: 4     NaN           0   
9   05/2019        2   NaN  Unnamed: 4     NaN           0   
10  06/2019        2     2  Unnamed: 4      10          10   
11  07/2019        4     4  Unnamed: 4      15          15   
12      NaN      NaN   NaN  Unnamed: 5     NaN               
13      NaN  Column2   Sum  Unnamed: 5       B               
14  04/2019        2   NaN  Unnamed: 5     NaN              0
15  05/2019        2   NaN  Unnamed: 5     NaN              0
16  06/2019        2     2  Unnamed: 5       8              8
17  07/2019        4     4  Unnamed: 5       7              7
18      NaN      NaN   NaN       Type2     4.9               
Sign up to request clarification or add additional context in comments.

6 Comments

Nice answer @piRSquared :)
Your alternate approach is almost there. How can I verify if has alpha and with spaces? I'm trying to use keys = pd.unique([s for s in s if str(s).isalpha() or str(s).isspace()]) But this not working...
@Twwister8889 include it in your example so that I understand
@piRSquared I edited, included the 'New Va' as a category My code I have: keys = pd.unique([s for s in s if str(s).replace(' ','').isalpha()]) -- I don't know if is the best way, but its working But the problem now, its here: a = ~v.str.isalpha().fillna(False) Here contains string with space
@Twwister8889 I understand now. I can include that later when I get home
|
2

Let us try:

m = df['Values'].str.contains(r'(?i)^[A-Z\s]+$', na=False)
c, b = list(df.loc[m, 'Values']), m.cumsum()

for _, v in df['Values'].groupby(b):
    if v.iat[0] in c:
        s = v.iloc[1:].fillna(0)
        df.loc[s.index, v.iat[0]] = s

df[c] = df[c].mask(df['Date'].isna()).fillna('')

Details:

Create a boolean mask with str.contains specifying the condition where Values contain Categories like New Va, P Va, B:

>>> m
0     False
1      True
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
Name: Values, dtype: bool

Identify the blocks starting with category in the Values columns:

>>> b

0     0
1     1
2     1
3     1
4     1
5     1
6     1
7     2
8     2
9     2
10    2
11    2
12    2
13    3
14    3
15    3
16    3
17    3
18    3
Name: Values, dtype: int64

Group the column Values on this blocks of elements and for each group add/update the category column in the dataframe with the values that follows the category in each block, finally mask the values in these newly added column where Date is NaN:

>>> df

       Date  Column1 Total        Type  Values New Va P Va  B
0       NaN      NaN   NaN       Type1     5.1               
1       NaN  Column2   Sum       Type1  New Va               
2   04/2019        2   NaN       Type1     NaN      0        
3   05/2019        2   NaN       Type1     NaN      0        
4   06/2019        2     2       Type1      14     14        
5   07/2019        4     4       Type1      16     16        
6       NaN      NaN   NaN  Unnamed: 4     NaN               
7       NaN  Column2   Sum  Unnamed: 4    P Va               
8   04/2019        2   NaN  Unnamed: 4     NaN           0   
9   05/2019        2   NaN  Unnamed: 4     NaN           0   
10  06/2019        2     2  Unnamed: 4      10          10   
11  07/2019        4     4  Unnamed: 4      15          15   
12      NaN      NaN   NaN  Unnamed: 5     NaN               
13      NaN  Column2   Sum  Unnamed: 5       B               
14  04/2019        2   NaN  Unnamed: 5     NaN              0
15  05/2019        2   NaN  Unnamed: 5     NaN              0
16  06/2019        2     2  Unnamed: 5       8              8
17  07/2019        4     4  Unnamed: 5       7              7
18      NaN      NaN   NaN       Type2     4.9               

1 Comment

The problem is: I will have one more row for other months for the N, P, and B category, so the result shows just the last occurs this values. In my dataset, the first occurs for N 0,0,14 and 16 is missing only the last values is showing, How can I solve this? Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.