Python adding rows to DataFrame based on string indexing

Question

I'm using a text file that is updated every day and I want to extract the values from the string and append them to a DataFrame. The text file doesn't change structurally (at least mostly), just the values are updated, so I've written some code to extract the values preceding the keywords in my list.

To make my life easier I've tried to build a for-loop to automate as much as possible, but frustratingly I'm stuck at appending the values I've sourced to my DataFrame. All the tutorials I've looked at are dealing with ranges in for loops.

empty_df = pd.DataFrame(columns = ["date","builders","miners","roofers"])

text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
text = text.split()
profession = ["builders","miners","roofers"]

for i in text:
    if i in profession:
       print(text[text.index(i) - 1] + " " + i)

400 builders
no miners
70 roofers

I've tried to append using:

for i in text:
    if i in profession:
       empty_df.append(text[text.index(i) - 1] + " " + i)

But it doesn't work, and I'm really unsure how to append multiple calculated variables.

So what I want to know is:

How can I append the resulting values to my empty dataframe in the correct columns.
How could I convert the 'no' or 'none' into zero.
How can I also incorporate the date each time I update this?

Thanks

vtasca · Accepted Answer · 2022-05-10 09:53:43Z

1

If you just want a plug and play solution, this will get you where you need to go:

from dateutil import parser
import numpy as np

empty_df = pd.DataFrame(columns = ["builders","miners","roofers","date"])
text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
date = parser.parse(text.split(',')[0]).strftime('%d %B %Y')

foo = text.split()
profession = ["builders","miners","roofers"]

total_no = []
for i in foo:
    if i in profession:
       total_no.append(foo[foo.index(i) - 1])

empty_df.loc[len(empty_df)] = total_no + [date]

empty_df.replace('no', np.nan)

Outputting:

    builders    miners  roofers date
0   400         NaN     70      10 May 2022

answered May 10, 2022 at 9:53

vtasca

1,78014 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pryore Over a year ago

Thanks, that worked really well. I just have one small additional question. I tried to replicate this approach in some text but geographically, so, e.g. builder numbers in Rome, Venice and Brussels, but I found the first value just gets replicated, so it seems the appending function stops after the first value. Is there a way to bypass this?

DataSciRookie · Accepted Answer · 2022-05-10 09:43:52Z

1)How can I append the resulting values to my empty dataframe in the correct columns.

I think you need to do a preprocess before, you should iterate in the sentence when you detect a keyword (builders) you take the words before and after (with spliting by ' '). Now the word before and after you try to transform it into a float if it works you stock the result in list of list : ['builders',400] and you have searched for everything you able to add the rows with all the informations

2) How could I convert the 'no' or 'none' into zero.

With my method you don't need if you were enable to transform the words before or after in a float, then it should be 0

3) How can I also incorporate the date each time I update this?

https://theautomatic.net/2018/12/18/2-packages-for-extracting-dates-from-a-string-of-text-in-python/

Collectives™ on Stack Overflow

Python adding rows to DataFrame based on string indexing

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related