IndexError: list index out of range with Regular expression

Question

I am trying to scrape data from this link https://www.seloger.com/ and I get this error, I don't understand what's wrong because I already tried this code before and it worked

import re
import requests
import csv
import json


with open("selog.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "Type", "Prix", "Code_postal", "Ville", "Departement", "Nombre_pieces", "Nbr_chambres", "Type_cuisine", "Surface"]) 


for i in range(1, 500):
   url = str('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=' + str(i))
   r = requests.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
   p = re.compile('var ava_data =(.*);\r\n\s+ava_data\.logged = logged;', re.DOTALL)
   x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')
   x = re.sub(r'\s{2,}|\\r\\n', '', x)
   data = json.loads(x)
   f = csv.writer(open("Seloger.csv", "wb+"))


   for product in data['products']:
      ID = product['idannonce']
      prix = product['prix']
      surface = product['surface']
      code_postal = product['codepostal']
      nombre_pieces = product['nb_pieces']
      nbr_chambres = product['nb_chambres']
      Type = product['typedebien']
      type_cuisine = product['idtypecuisine']
      ville = product['ville']
      departement = product['departement']
      etage = product['etage']
      writer.writerow([ID, Type, prix, code_postal, ville, departement, nombre_pieces, nbr_chambres, type_cuisine, surface])

this the error :

Traceback (most recent call last):
File "Seloger.py", line 20, in <module>
x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')
IndexError: list index out of range

list index out of range means that something wrong with index [0] so check first what you have in print( p.findall(r.text) ) — furas
– furas, Commented May 16, 2019 at 11:15
if you get empty list for p.findall(r.text) then you could check r.text - you can save it in file and open in web browser - maybe there is some useful information or warning for bots/scripts or captch. — furas
– furas, Commented May 16, 2019 at 11:28
I run code and sometimes I get page with text "Oops, une erreur technique est survenue. Merci de ressayer ultérieurement." which means "oops, a technical error has occurred. please try again later." and then findall()` returns empty list - so it has no index [1] and code shows error list index out of range — furas
– furas, Commented May 16, 2019 at 11:57

user7313188 · Accepted Answer · 2019-05-16 11:33:50Z

1

This line is wrong:

x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

what you need to find in text?

for working scraped on text you need change above line to:

x = r.text.strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

and then finding something you need

answered May 16, 2019 at 11:33

user7313188

Sign up to request clarification or add additional context in comments.

1 Comment

furas Over a year ago

problem is that sometimes page shows message "Oops, une erreur technique est survenue. Merci de ressayer ultérieurement."which means "Oops, a technical error has occurred. please try again later." and then findall() can't find expected text.

Wiktor Stribiżew · Accepted Answer · 2019-06-06 07:34:07Z

The error occurs because sometimes there is no match, and you are trying to access a non-existing item in an empty list. The same result can be reproduced with print(re.findall("s", "d")[0]).

To fix the issue, replace x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\') line with

x = ''
xm = p.search(r.text)
if xm:
    x = xm.group(1).strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

NOTES

When you use p.findall(r.text)[0], you want to get the first match in the input, so re.search is best here as it only returns the first match
To obtain the substirng captured in the first capturing group, you need to use matchObject.grou[p(1)
if xm: is important: if there is no match, x will remain an empty string, else, it will be assigned the modified value in Group 1.

Collectives™ on Stack Overflow

IndexError: list index out of range with Regular expression

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related