0

In my scrapy project I want to extract data from a website. It turned out that all information are stored in some script that I can easily read in JSON format and from there extract the data I need.

That's my function:

    def parse(self, response):
        items = response.css("script:contains('window.__INITIAL_STATE__')::text").re_first(r"window\.__INITIAL_STATE__ =(.*);")
        for item in json.loads(items)['offers']:
            yield {
                "title": item['jobTitle'],
                "employer": item['employer'],
                "country": item['countryName'],
                "details_page": item['companyProfileUrl'],
                "expiration_date": item['expirationDate'],
                'salary': item['salary'],
                'employmentLevel': item['employmentLevel'],
            }

And json file have that structure:

var = {
    "offers":[
      {
        "commonOfferId":"1200072247",
        "jobTitle":"Automatyk - Programista",
        "employer":"MULTIPAK Spółka Akcyjna",
        "companyProfileUrl":"https://pracodawcy.pracuj.pl/company/20379037/profile",
        "expirationDate":"2021-04-28T12:47:06.273",
        "salary":"",
        "employmentLevel":"Specjalista (Mid / Regular)" ,
        "offers": [
                {
            "offerId":500092126,
            "regionName":"kujawsko-pomorskie",
            "cities":["Małe Czyste (pow. chełmiński)"],
            "label":"Małe Czyste (pow. chełmiński)"}], 

Above example of one element. So when I try to extract data like cities or regioName I receive an error. How can I make for loop from throughout two dictionaries and yield that data date to the new dictionary?

1
  • So, each offer has multiple "offers". What do you want your output to be? Do you want one entry per inner offer, so you potentially get multiple entries per outer offer? Commented Apr 13, 2021 at 22:11

1 Answer 1

1

You didn't make it clear what you want, but I'm guessing this is close:

    def parse(self, response):
        items = response.css("script:contains('window.__INITIAL_STATE__')::text").re_first(r"window\.__INITIAL_STATE__ =(.*);")
        for item in json.loads(items)['offers']:
            for offer in item['offers']:
                yield {
                    "title": item['jobTitle'],
                    "employer": item['employer'],
                    "country": item['countryName'],
                    "details_page": item['companyProfileUrl'],
                    "expiration_date": item['expirationDate'],
                    'salary': item['salary'],
                    'employmentLevel': item['employmentLevel'],
                    'offernumber': offer['offerId'],
                    'region': offer['regionName'],
                    'city': offer['cities'][0]
                }
Sign up to request clarification or add additional context in comments.

1 Comment

That is exactly what I need to. Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.