Python if-statement within list comprehension

Question

I have this line of code:

bitext = [[sentence.strip().split() 
           for sentence in pair if len(sentence) < 100] 
          for pair in zip(open(c_data), open(e_data))[:opts.num_sents]]

c_data is a file with Chinese sentences
e_data is a file with English sentences.
bitext should be a list that contains pairs of English and Chinese sentences, which are translations of one another.

Since both data files are huge,
I want to reduce the complexity of my code by only taking into consideration sentences that are under a certain length. The length is measured in characters.

As an example,
I've specified length here as 100. :opts.num_sents is a variable that states how many sentences from the data files should be taken into consideration.

The problem/bug
If a Chinese sentence would be, say, 95 characters, and an English sentence 105 characters, bitext would be updated with the Chinese sentence only.
But I want the code only to add a pair of sentences if both of them are under the stated length.
How do I do this?

I am sorry, but your question is hard to understand, what exactly are you trying to do? Your question title does not seem to make any sense in relation to your question. — Inbar Rose
– Inbar Rose, Commented Mar 4, 2013 at 10:24
This isn't an if statement within a for loop - it's a list comprehension. — Gareth Latty
– Gareth Latty, Commented Mar 4, 2013 at 10:26
Forget the title, I didn't know that this is called a list comprehension. My question is about dealing with pairs where one of them satisfies len(sentence) < 100 but the other doesn't. — Johanna
– Johanna, Commented Mar 4, 2013 at 10:44

Blender · Accepted Answer · 2013-03-04 10:43:57Z

2

It's time to break up this one-liner:

def tokenize(sentence):
    return sentence.strip().split()

def sentence_pairs(c_data, e_data):
    for chinese, english in zip(open(c_data), open(e_data))[:opts.num_sents]:
        if len(chinese) < 100 and len(english) < 100
            yield tokenize(chinese), tokenize(english)

The yield keyword turns sentence_pairs into a generator. If you only iterate over the results, it's a simpler way of writing:

def sentence_pairs(c_data, e_data):
    results = []

    for chinese, english in zip(open(c_data), open(e_data))[:opts.num_sents]:
        if len(chinese) < 100 and len(english) < 100
           results.append((chinese, english))

    return results

edited Mar 4, 2013 at 10:43

answered Mar 4, 2013 at 10:29

Blender

300k55 gold badges463 silver badges512 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dhara Over a year ago

You should also consider adding an explanation of why this works.

Dhara Over a year ago

If the OP doesn't understand list-comprehension (based on the Question title), I assume he doesn't get generators either. anyway I upvoted :)

Blender Over a year ago

@Dhara: Thanks, I forgot about the generator.

danodonovan · Accepted Answer · 2013-03-04 10:35:27Z

1

First of all, rewrite your code so that it's understandable! List comprehensions are great, but when they disappear over the end of the page they get very difficult to understand.

bitext = [[sentence.strip().split() for sentence in pair if len(sentence) < 100] for pair in zip(open(c_data), open(e_data)) [:opts.num_sents]]

is the same (essentially) as

bitext = []
for i, pair in enumerate(zip(open(c_data), open(e_data))):
    if i < opts.num_sents:
        sentence_pair = []
        for sentence in pair:
            if len(sentence) < 100:
                sentence_pair.append(sentence.strip().split())
        if len(sentence_pair) > 1:  # ie both sentences are < 100
            bitext.append(sentence_pair)

Now, you want to add sentences with a length > 100. You can see that the line

if len(sentence) < 100:

is preventing that, so change the 100.

edited Mar 4, 2013 at 10:35

answered Mar 4, 2013 at 10:27

danodonovan

20.5k10 gold badges75 silver badges78 bronze badges

4 Comments

Dhara Over a year ago

I think the question is not to add sentences with length>100. The OP is trying to ask how to add a pair of sentences to the list if they both meet the criteria of len<100

Johanna Over a year ago

The point is that I always want to add either both sentences or none (they are translations of each other, so it wouldn't make sense to add only one). If the length of one of the sentences is <100, and of the other is >100, the one with length <100 should not be added, although it confirms to the code. Did I make myself clearer now?

danodonovan Over a year ago

Not really. The code above will add both sentences if they conform or none (ignore Dhara's comment).

Johanna Over a year ago

Yes, thanks for your help! This is what I was looking for. I just got a bit confused by Dhara's comment.

Inbar Rose · Accepted Answer · 2013-03-04 12:26:23Z

1

I think what you are trying to do is maybe this:

bitext = [[sentence.strip().split() for sentence in pair] 
  for pair in zip(open(c_data), open(e_data))[:opts.num_sents] if all(len(s) < 100 for s in pair)]

Which is very ugly in a list-comprehension, I recommend you use one of the other methods suggested here.

edited Mar 4, 2013 at 12:26

answered Mar 4, 2013 at 10:30

Inbar Rose

43.8k24 gold badges91 silver badges137 bronze badges

1 Comment

Johanna Over a year ago

Yes this is what I meant! But it adds empty lists when if all(len(s) < 100 for s in pair), do you know how I can stop this from happening?

Collectives™ on Stack Overflow

Python if-statement within list comprehension

3 Answers 3

3 Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related