Sentence processing in Python

Question

I have a file with such data:

Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'

What I want to print out are all Sentences0. This is what I have done, but it prints out a blank list.

from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first

Is this homework, or do you have multiple accounts? This exact post just showed up, barely minutes ago — inspectorG4dget
– inspectorG4dget, Commented Oct 30, 2013 at 20:31
Since you're reading individual lines, your regexp will never succeed: It's looking for a newline in the middle of a string. And why are you searching for Abc when you say you want the first of the three sentences? Please clarify your task. — alexis
– alexis, Commented Oct 30, 2013 at 20:32
@inspectorG4dget I know about that post. It was created coz i couldn't post a question from my own account. When i succeed, it was deleted. — Khrystyna Pyurkovska
– Khrystyna Pyurkovska, Commented Oct 30, 2013 at 22:13
@alexis, actually I have a great amount of text with paragraphs separated with new lines. And I need to print out only first sentences of each paragraph. — Khrystyna Pyurkovska
– Khrystyna Pyurkovska, Commented Oct 30, 2013 at 22:17

inspectorG4dget · Accepted Answer · 2013-10-30 20:33:29Z

3

You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick:

with open('topon.txt') as infile:
  for line in infile:
    print line.split('.', 1)[0]

answered Oct 30, 2013 at 20:33

inspectorG4dget

115k30 gold badges159 silver badges253 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ilalex · Accepted Answer · 2013-10-30 20:39:12Z

1

In addition to @inspectorG4dget 's answer, you can do it by regexes:

from nltk import *
import codecs

f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]

answered Oct 30, 2013 at 20:39

ilalex

3,0862 gold badges27 silver badges37 bronze badges

Comments

alexis · Accepted Answer · 2013-10-31 00:44:25Z

1

Splitting a paragraph at periods works only if every sentence ends with a period, and periods are used for nothing else. If you have a lot of real text, neither of these is even close to true. Abbreviations, questions? exclamations! etc. will trip you up a lot. So, use the tool that the nltk provides for this purpose: the function sent_tokenize(). It's not perfect, but it's a whole lot better than looking for periods. If text is your list of paragraphs, you use it like this:

first = [ ]
for par in text:
    sentences = nltk.sent_tokenize(par)
    first.append(sentences[0])

You could fold the above into a list comprehension, but it's not going to be very readable...

answered Oct 31, 2013 at 0:44

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Collectives™ on Stack Overflow

Sentence processing in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related