2

I have a text file which I obtained from converting a .srt file. The content is as follows:

1
0:0:1,65 --> 0:0:7,85
Hello, my name is Gareth, and in this
video, I'm going to talk about list comprehensions


2
0:0:7,85 --> 0:0:9,749
in Python.

I want only the words present the text file such that the output is a new textfile op.txt, with the output represented as:

Hello
my
name 
is
Gareth
and

and so on.

This is the program I'm working on:

import os, re
f= open("D:\captionsfile.txt",'r')
k=f.read()
g=str(k)
f.close()
w=re.search('[a-z][A-Z]\s',g)
fil=open('D:\op.txt','w+')
fil.append(w)
fil.close()

But the output I get for this program is:

None
None
None
1
  • 1
    Your regex is wrong. I think you need a bit more practice at it. Commented May 31, 2014 at 9:35

2 Answers 2

3

If we assume m is a word and short for am and that in.txt is your textfile, you can use

import re

with open('in.txt') as intxt:
    data = intxt.read()

x = re.findall('[aA-zZ]+', data)
print(x)

which will produce

['Hello', 'my', 'name', 'is', 'Gareth', 'and', 'in', 'this', 'video', 'I', 'm', 'going', 'to', 'talk', 'about', 'list', 'comprehensions', 'in', 'Python']

You can now write x to a new file with:

with open('out.txt', 'w') as outtxt:
    outtxt.write('\n'.join(x))

To get

I'm

instead of

I
m

you can use re.findall('[aA-zZ\']+')

Sign up to request clarification or add additional context in comments.

4 Comments

when did m become a word?
@Padraic Cunningham I am not sure whether "I'm" is A word or should be treated as two words, namely "I" and "am" where "m" in "I'm" is short for "am".
Doesn't your regex has to be '[a-zA-Z]+'?
Ok, it definitely has to be '[a-zA-Z]+'. If you use '[aA-zZ\']+' you have all ascii characters from A to z, which means that words like 'hell[]o' would also be matched, because '[' and ']' are between 'A' and 'z' (if you have a look at the ascii table).
1
with open("out.txt","a") as f1:
    with open("b.txt")  as f:
        for line in f:
            if not line[0].isdigit():
                for word in line.split():
                    f1.write(re.sub(r'[,.!]', "", word)) # replace any punctuation you don't want
                    f1.write("\n")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.