0

I'm using a text file that is updated every day and I want to extract the values from the string and append them to a DataFrame. The text file doesn't change structurally (at least mostly), just the values are updated, so I've written some code to extract the values preceding the keywords in my list.

To make my life easier I've tried to build a for-loop to automate as much as possible, but frustratingly I'm stuck at appending the values I've sourced to my DataFrame. All the tutorials I've looked at are dealing with ranges in for loops.

empty_df = pd.DataFrame(columns = ["date","builders","miners","roofers"])

text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
text = text.split()
profession = ["builders","miners","roofers"]

for i in text:
    if i in profession:
       print(text[text.index(i) - 1] + " " + i)

400 builders
no miners
70 roofers

I've tried to append using:

for i in text:
    if i in profession:
       empty_df.append(text[text.index(i) - 1] + " " + i)

But it doesn't work, and I'm really unsure how to append multiple calculated variables.

So what I want to know is:

  1. How can I append the resulting values to my empty dataframe in the correct columns.
  2. How could I convert the 'no' or 'none' into zero.
  3. How can I also incorporate the date each time I update this?

Thanks

2 Answers 2

1

If you just want a plug and play solution, this will get you where you need to go:

from dateutil import parser
import numpy as np

empty_df = pd.DataFrame(columns = ["builders","miners","roofers","date"])
text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
date = parser.parse(text.split(',')[0]).strftime('%d %B %Y')

foo = text.split()
profession = ["builders","miners","roofers"]

total_no = []
for i in foo:
    if i in profession:
       total_no.append(foo[foo.index(i) - 1])

empty_df.loc[len(empty_df)] = total_no + [date]

empty_df.replace('no', np.nan)

Outputting:

    builders    miners  roofers date
0   400         NaN     70      10 May 2022
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that worked really well. I just have one small additional question. I tried to replicate this approach in some text but geographically, so, e.g. builder numbers in Rome, Venice and Brussels, but I found the first value just gets replicated, so it seems the appending function stops after the first value. Is there a way to bypass this?
0

1)How can I append the resulting values to my empty dataframe in the correct columns.

I think you need to do a preprocess before, you should iterate in the sentence when you detect a keyword (builders) you take the words before and after (with spliting by ' '). Now the word before and after you try to transform it into a float if it works you stock the result in list of list : ['builders',400] and you have searched for everything you able to add the rows with all the informations

2) How could I convert the 'no' or 'none' into zero.

With my method you don't need if you were enable to transform the words before or after in a float, then it should be 0

3) How can I also incorporate the date each time I update this?

https://theautomatic.net/2018/12/18/2-packages-for-extracting-dates-from-a-string-of-text-in-python/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.