0

I am using RegEx to extract some data from a txt file. I've made the below for-loops to extract emails and birthdates and (tried) to append the outputs to a list. But when I print my list only the first appended output is printed. The birtdate RegEx works fine when run by itself. I'm sure I'm doing something very basic wrong.

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")

list = []

for i in f:
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)

for k in f:
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(k)

print(list)
f.close()
3
  • Not an answer but just noticing that you are using the case-insensitive modifier (?i) in your first pattern. So you could get rid of A-Z. Also in your second regex > \d\d\d\d is better written \d{4} Commented Apr 10, 2020 at 14:17
  • Does this answer your question? Read multiple times lines of the same file Python Commented Apr 10, 2020 at 14:17
  • your iterator f has reached the end of file (EOF) already when you're entering the second loop. So you either need to do f.seek(0) before the second loop, or just | two regexes, I think piping two regexes should work just fine Commented Apr 10, 2020 at 14:18

2 Answers 2

1

You try to read the same file twice. The second for-loop will not do anything. Have a look at this to understand:

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
print(list(f))
print("second time:")
print(list(f))

Output:

['1234567890abcdefghijklmopqrstuvwxyz'] # or whatever your content is :)
second time:
[]

To fix this you can store the result of the file in a list (if you are not dealing with huge files, of course):

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
content = list(f)


for i in content:
   ... 

for k in content:
   ... 

In your specific example it would be cleaner (and faster) to do all processing in a single for-loop, though. However, the mistake was to try to read twice from the same file without resetting it.

Sign up to request clarification or add additional context in comments.

4 Comments

Note of caution, if the file is large, storing it as a list can result in size of list being HUGE.
True. I just hoped the list of emails and birthdays is not in the order of millions.
@abhinonymous : added a note about this.
Imagine doing that over a wiki dump, I'm sure someone has done that at some point of time :)
1

Try this:

with open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8") as f:
    i = f.readline()
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(i)

in your code, after the first for loop, f is now pointing to the end of the file and so the second for loop doesn't "run" as you're intending it to run.

So to modify your code to get it to do what you intended you would close file after first loop and reopen it before second loop so that the file pointer f points to begining of file again:

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")

list = []

for i in f:
    if re.findall(r"((?i)[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.])", i):
        list.append(i)

f.close()

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
for k in f:
    if re.findall(r'\d\d-\d\d-\d\d\d\d', k):
        list.append(k)

print(list)
f.close()

1 Comment

Please when answering, explain to the OP it's error, and how do your code can fix it. The main goal of SO is to make people learn stuff, not copy code that just work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.