1

I have this line of code:

bitext = [[sentence.strip().split() 
           for sentence in pair if len(sentence) < 100] 
          for pair in zip(open(c_data), open(e_data))[:opts.num_sents]]

c_data is a file with Chinese sentences
e_data is a file with English sentences.
bitext should be a list that contains pairs of English and Chinese sentences, which are translations of one another.

Since both data files are huge,
I want to reduce the complexity of my code by only taking into consideration sentences that are under a certain length. The length is measured in characters.

As an example,
I've specified length here as 100. :opts.num_sents is a variable that states how many sentences from the data files should be taken into consideration.

The problem/bug
If a Chinese sentence would be, say, 95 characters, and an English sentence 105 characters, bitext would be updated with the Chinese sentence only.
But I want the code only to add a pair of sentences if both of them are under the stated length.
How do I do this?

3
  • I am sorry, but your question is hard to understand, what exactly are you trying to do? Your question title does not seem to make any sense in relation to your question. Commented Mar 4, 2013 at 10:24
  • This isn't an if statement within a for loop - it's a list comprehension. Commented Mar 4, 2013 at 10:26
  • Forget the title, I didn't know that this is called a list comprehension. My question is about dealing with pairs where one of them satisfies len(sentence) < 100 but the other doesn't. Commented Mar 4, 2013 at 10:44

3 Answers 3

2

It's time to break up this one-liner:

def tokenize(sentence):
    return sentence.strip().split()

def sentence_pairs(c_data, e_data):
    for chinese, english in zip(open(c_data), open(e_data))[:opts.num_sents]:
        if len(chinese) < 100 and len(english) < 100
            yield tokenize(chinese), tokenize(english)

The yield keyword turns sentence_pairs into a generator. If you only iterate over the results, it's a simpler way of writing:

def sentence_pairs(c_data, e_data):
    results = []

    for chinese, english in zip(open(c_data), open(e_data))[:opts.num_sents]:
        if len(chinese) < 100 and len(english) < 100
           results.append((chinese, english))

    return results
Sign up to request clarification or add additional context in comments.

3 Comments

You should also consider adding an explanation of why this works.
If the OP doesn't understand list-comprehension (based on the Question title), I assume he doesn't get generators either. anyway I upvoted :)
@Dhara: Thanks, I forgot about the generator.
1

First of all, rewrite your code so that it's understandable! List comprehensions are great, but when they disappear over the end of the page they get very difficult to understand.

bitext = [[sentence.strip().split() for sentence in pair if len(sentence) < 100] for pair in zip(open(c_data), open(e_data)) [:opts.num_sents]]

is the same (essentially) as

bitext = []
for i, pair in enumerate(zip(open(c_data), open(e_data))):
    if i < opts.num_sents:
        sentence_pair = []
        for sentence in pair:
            if len(sentence) < 100:
                sentence_pair.append(sentence.strip().split())
        if len(sentence_pair) > 1:  # ie both sentences are < 100
            bitext.append(sentence_pair)

Now, you want to add sentences with a length > 100. You can see that the line

if len(sentence) < 100:

is preventing that, so change the 100.

4 Comments

I think the question is not to add sentences with length>100. The OP is trying to ask how to add a pair of sentences to the list if they both meet the criteria of len<100
The point is that I always want to add either both sentences or none (they are translations of each other, so it wouldn't make sense to add only one). If the length of one of the sentences is <100, and of the other is >100, the one with length <100 should not be added, although it confirms to the code. Did I make myself clearer now?
Not really. The code above will add both sentences if they conform or none (ignore Dhara's comment).
Yes, thanks for your help! This is what I was looking for. I just got a bit confused by Dhara's comment.
1

I think what you are trying to do is maybe this:

bitext = [[sentence.strip().split() for sentence in pair] 
  for pair in zip(open(c_data), open(e_data))[:opts.num_sents] if all(len(s) < 100 for s in pair)]

Which is very ugly in a list-comprehension, I recommend you use one of the other methods suggested here.

1 Comment

Yes this is what I meant! But it adds empty lists when if all(len(s) < 100 for s in pair), do you know how I can stop this from happening?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.