1

I have a file with such data:

Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'

What I want to print out are all Sentences0. This is what I have done, but it prints out a blank list.

from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first
5
  • 2
    Is this homework, or do you have multiple accounts? This exact post just showed up, barely minutes ago Commented Oct 30, 2013 at 20:31
  • 1
    Since you're reading individual lines, your regexp will never succeed: It's looking for a newline in the middle of a string. And why are you searching for Abc when you say you want the first of the three sentences? Please clarify your task. Commented Oct 30, 2013 at 20:32
  • @inspectorG4dget I know about that post. It was created coz i couldn't post a question from my own account. When i succeed, it was deleted. Commented Oct 30, 2013 at 22:13
  • @alexis, actually I have a great amount of text with paragraphs separated with new lines. And I need to print out only first sentences of each paragraph. Commented Oct 30, 2013 at 22:17
  • Fair enough. I just wanted to make sure Commented Oct 30, 2013 at 22:22

3 Answers 3

3

You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick:

with open('topon.txt') as infile:
  for line in infile:
    print line.split('.', 1)[0]
Sign up to request clarification or add additional context in comments.

Comments

1

In addition to @inspectorG4dget 's answer, you can do it by regexes:

from nltk import *
import codecs

f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]

Comments

1

Splitting a paragraph at periods works only if every sentence ends with a period, and periods are used for nothing else. If you have a lot of real text, neither of these is even close to true. Abbreviations, questions? exclamations! etc. will trip you up a lot. So, use the tool that the nltk provides for this purpose: the function sent_tokenize(). It's not perfect, but it's a whole lot better than looking for periods. If text is your list of paragraphs, you use it like this:

first = [ ]
for par in text:
    sentences = nltk.sent_tokenize(par)
    first.append(sentences[0])

You could fold the above into a list comprehension, but it's not going to be very readable...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.