0

Let's say I have a dataframe that looks like this:

REFERENCE_CODE
dog
1
2
3
4
cat
1
2

4
5

rat

3
4
5

fish
4
5
6

Notice the spaces.. I would like to achieve a dataframe that looks like this:

REFERENCE_CODE
dog
dog_1
dog_2
dog_3
dog_4
cat
cat_1
cat_2

cat_4
cat_5

rat

rat_3
rat_4
rat_5

fish
fish_4
fish_5
fish_6

I have tried something similar to the following:

for index, row in df.iterrows():
    if isinstance(row['REFERENCE_CODE'], str):
       great! continue
    elif isinstance(row['REFERENCE_CODE'], int):
       go back up and find the last instance, concatenate
    else:
       pass

I am having trouble filling out the areas where there is pseudocode. Am I correct in my logic? Is there any easier way to go about doing this? I would ideally like to hold the integrity of the original data in terms of blank spaces, size, etc. but if not, that is ok too. I will find a workaround! Thanks.


As per Andy Hayden:

Traceback (most recent call last):
  Question number REFERENCE_CODE  ... Unnamed: 12 Unnamed: 13
  File "/Users/xxx/Projects/trend_env/src/script4.py", line 10, in <module>
0             Q1a     ladder_now  ...         NaN         NaN
1             NaN            NaN  ...         NaN         NaN
2             NaN              1  ...         NaN         NaN
    headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()
3             NaN              2  ...         NaN         NaN
  File "/Users/xxx/Projects/trend_env/lib/python3.7/site-packages/pandas/core/generic.py", line 1466, in __invert__
4             NaN              3  ...         NaN         NaN
    arr = operator.inv(com.values_from_object(self))

TypeError: bad operand type for unary ~: 'float'

  Question number REFERENCE_CODE  ... Unnamed: 12 Unnamed: 13
0             Q1a     ladder_now  ...         NaN         NaN
1             NaN            NaN  ...         NaN         NaN
2             NaN              1  ...         NaN         NaN
3             NaN              2  ...         NaN         NaN
4             NaN              3  ...         NaN         NaN

[5 rows x 14 columns]

Traceback (most recent call last):
  File "/Users/mitchell_bregman/Projects/trend_env/src/script4.py", line 14, in <module>
    headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()
  File "/Users/mitchell_bregman/Projects/trend_env/lib/python3.7/site-packages/pandas/core/generic.py", line 1466, in __invert__
    arr = operator.inv(com.values_from_object(self))
TypeError: bad operand type for unary ~: 'float'
7
  • 1
    Can you give the output of df.to_dict() for these DataFrames, it's hard to infer what they actually are Commented Feb 8, 2019 at 0:59
  • Also, does this start as a csv? That might be easier to convert than a DataFrame. Commented Feb 8, 2019 at 1:00
  • I can't because I created some dummy data. Commented Feb 8, 2019 at 1:00
  • 1
    Please create some dummy data, e.g. a ten line csv we can read into a DataFrame :) Commented Feb 8, 2019 at 1:02
  • One second! I will do that and get back to you. Commented Feb 8, 2019 at 1:03

2 Answers 2

1

To get the groups you can use a mask and cumsum:

In [11]: headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()

In [12]: headers.cumsum()
Out[12]:
0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     2
9     2
10    2
11    2
12    3
13    3
14    3
15    3
16    3
17    3
18    4
19    4
20    4
21    4
Name: REFERENCE_CODE, dtype: int64

Now you can use this to groupby:

In [13]: res = df.groupby(headers.cumsum())['REFERENCE_CODE'].apply(lambda x: x.iloc[0] + '_' + x)

In [14]: res
Out[14]:
0       dog_dog
1         dog_1
2         dog_2
3         dog_3
4         dog_4
5       cat_cat
6         cat_1
7         cat_2
8          cat_
9         cat_4
10        cat_5
11         cat_
12      rat_rat
13         rat_
14        rat_3
15        rat_4
16        rat_5
17         rat_
18    fish_fish
19       fish_4
20       fish_5
21       fish_6
Name: REFERENCE_CODE, dtype: object

and only use the relevant (numeric) columns:

In [15]: df.REFERENCE_CODE.update(res[df.REFERENCE_CODE.str.isnumeric()])

In [16]: df
Out[16]:
   REFERENCE_CODE
0             dog
1           dog_1
2           dog_2
3           dog_3
4           dog_4
5             cat
6           cat_1
7           cat_2
8
9           cat_4
10          cat_5
11
12            rat
13
14          rat_3
15          rat_4
16          rat_5
17
18           fish
19         fish_4
20         fish_5
21         fish_6

It might be easier to convert this on the way in... I would argue that this is a strange objective (and would be a little easier in regular python).

Sign up to request clarification or add additional context in comments.

6 Comments

Hey, I followed yours-- edits above... is this a 3.6 v 3.7 thing?
@sgerbhctim I am also on 3.7, that's weird. What is the output of df.REFERENCE_CODE.str.isnumeric().dtype ?
@sgerbhctim you might want to first to df.REFERENCE_CODE = df.REFERENCE_CODE.fillna('')
Edited again above.. hahah, super sorry about this - such a weird task
@sgerbhctim even after the fillna? It should be bool.
|
0

What you could do is to apply a function along that series, using a mutable variable on the function to work as a "cache". I'll asume that what you have is the following list of values:

ls = ['dog', 1, 2, 3, 4, 'cat', 1, 2, '', 4, 5,
      'rat', '', 3, 4, 5, '', 'fish', 4, 5, 6]


def append_string(x, last_string_value=['initial_string']):
    if isinstance(x, str) or x is None:
        if x:
            last_string_value[0] = x
        return x
    else:
        return last_string_value[0] + '_{}'.format(x)


print(list(map(append_string, ls)))

This will give you the result you need. If what you have is a dataframe, what you can do is to apply this function along the corresponding series, and you would get the same effect.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.