Python program to extract text from a text file?

Question

I have a text file which I obtained from converting a .srt file. The content is as follows:

1
0:0:1,65 --> 0:0:7,85
Hello, my name is Gareth, and in this
video, I'm going to talk about list comprehensions


2
0:0:7,85 --> 0:0:9,749
in Python.

I want only the words present the text file such that the output is a new textfile op.txt, with the output represented as:

Hello
my
name 
is
Gareth
and

and so on.

This is the program I'm working on:

import os, re
f= open("D:\captionsfile.txt",'r')
k=f.read()
g=str(k)
f.close()
w=re.search('[a-z][A-Z]\s',g)
fil=open('D:\op.txt','w+')
fil.append(w)
fil.close()

But the output I get for this program is:

None
None
None

Your regex is wrong. I think you need a bit more practice at it. — Nafiul Islam
– Nafiul Islam, Commented May 31, 2014 at 9:35

timgeb · Accepted Answer · 2014-05-31 09:58:46Z

3

If we assume m is a word and short for am and that in.txt is your textfile, you can use

import re

with open('in.txt') as intxt:
    data = intxt.read()

x = re.findall('[aA-zZ]+', data)
print(x)

which will produce

['Hello', 'my', 'name', 'is', 'Gareth', 'and', 'in', 'this', 'video', 'I', 'm', 'going', 'to', 'talk', 'about', 'list', 'comprehensions', 'in', 'Python']

You can now write x to a new file with:

with open('out.txt', 'w') as outtxt:
    outtxt.write('\n'.join(x))

To get

I'm

instead of

I
m

you can use re.findall('[aA-zZ\']+')

edited May 31, 2014 at 9:58

answered May 31, 2014 at 9:38

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Padraic Cunningham Over a year ago

when did m become a word?

timgeb Over a year ago

@Padraic Cunningham I am not sure whether "I'm" is A word or should be treated as two words, namely "I" and "am" where "m" in "I'm" is short for "am".

miindlek Over a year ago

Doesn't your regex has to be '[a-zA-Z]+'?

miindlek Over a year ago

Ok, it definitely has to be '[a-zA-Z]+'. If you use '[aA-zZ\']+' you have all ascii characters from A to z, which means that words like 'hell[]o' would also be matched, because '[' and ']' are between 'A' and 'z' (if you have a look at the ascii table).

Padraic Cunningham · Accepted Answer · 2014-05-31 11:29:53Z

1

with open("out.txt","a") as f1:
    with open("b.txt")  as f:
        for line in f:
            if not line[0].isdigit():
                for word in line.split():
                    f1.write(re.sub(r'[,.!]', "", word)) # replace any punctuation you don't want
                    f1.write("\n")

edited May 31, 2014 at 11:29

answered May 31, 2014 at 9:50

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Collectives™ on Stack Overflow

Python program to extract text from a text file?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related