How to fetch a substring from text file in python?

Question

I have a bunch of tweets in plaintext form that is shown below. I am looking to extract the text part only.

SAMPLE DATA IN FILE -

Fri Nov 13 20:27:16 +0000 2015 4181010297 rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to  next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue?    @ golden bee

This is my attempt at the preprocess stage -

for filename in glob.glob('*.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            text=""
            for i in temp:
                x=str(i)
                if x.isalpha() :
                    text += x + ' '
            print(text)

OUTPUT-

Fri Nov rt treating one of you lads to this denim simply follow rt to 
Fri Nov this album is so proud of i loved this it really is the 
Fri Nov international break is garbage boring and your players get 
Fri Nov get weather updates from the weather 
Fri Nov woah what happened to twitter this update is 
Fri Nov completed the daily quest in paradise island 
Fri Nov new henderson memorial public 
Fri Nov going to next 
Fri Nov why so golden

This output is not the desired output because

1. It will not let me fetch numbers/digits within the text part of the tweet.
2. Every line starts with FRI NOV.

Could you please suggest a better method to achieve the same? I am not too familiar with regex, but I assume we could employ re.search(r'2015(magic to remove tweetID)/w*',tweet)

alecxe · Accepted Answer · 2016-04-25 20:27:09Z

7

You can avoid regular expressions in this case. The lines of the text you've presented are consistent in terms of how many spaces go before the tweet text. Just split():

>>> data = """
   lines with tweets here
"""
>>> for line in data.splitlines():
...     print(line.split(" ", 7)[-1])
... 
rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

edited Apr 25, 2016 at 20:27

answered Apr 25, 2016 at 20:24

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ic3fr0g Over a year ago

What is that [-1] doing ? Setting the index to just before the space?

alecxe Over a year ago

@MayurH line.split(" ", 7) splits a line by the first 7 spaces. It produces a list in which the tweet text is the last item - we get it by the last index.

quapka Over a year ago

@MayurH The index -1 in <any-list>[-1] points to the last position in <any-list> (gives IndexError on an empty list). You can do fancy stuff like <some-list>[-3:] to get a list of the last three elements etc..

Ic3fr0g Over a year ago

I didn't know that line.split() could take more than one argument. Thanks! I forgot that it returns a list. It makes sense now.

danidee · Accepted Answer · 2016-04-25 20:50:24Z

2

You can do it without a regular expression

import glob

for filename in glob.glob('file.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            print('{}'.format(' '.join(temp[7:])))

edited Apr 25, 2016 at 20:50

answered Apr 25, 2016 at 20:36

danidee

9,6342 gold badges39 silver badges58 bronze badges

1 Comment

Ic3fr0g Over a year ago

This again is undesired output I believe. This includes FRI NOV? But I realise now that I simply had to break the split and join after the 7th space. Thanks for your answer.

cromod · Accepted Answer · 2016-04-25 21:44:35Z

I propose a little more specific pattern than @Rushy Panchal to avoid issues when tweets include digits: .+ \+(\d+ ){3}

Use re.sub function

>>> import re
>>> with open('your_file.txt','r') as file:
...     data = file.read()
...     print re.sub('.+ \+(\d+ ){3}','',data)

Output

rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

Rushy Panchal · Accepted Answer · 2016-04-25 20:34:02Z

0

The pattern you are looking for is .+ \d+:

import re
p = re.compile(".+ \d+")
tweets = p.sub('', data) # data is the original string

Breakdown of the Pattern

. matches any character, and + matches 1 or more. So, .+ matches one or more characters. However, if we left it at just this, we would remove all of the text.

So, we want to end the pattern with \d+ – \d matches any digit, and so this would match any continuous sequence of digits, the last of which are the tweet IDs.

answered Apr 25, 2016 at 20:34

Rushy Panchal

17.7k16 gold badges66 silver badges94 bronze badges

2 Comments

Ic3fr0g Over a year ago

Will check this and revert to you.

cromod Over a year ago

Your pattern doesn't work for this line: Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19. You display :27:19.

Collectives™ on Stack Overflow

How to fetch a substring from text file in python?

4 Answers 4

4 Comments

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related