Delete strings from an array when this string match with a part of a sentence - Perl

Question

I'm matching multiple patterns in a string to populate an array. The input file looks like this:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

I use this code:

use strict;
use warnings;
use Data::Dump;

open(TEXT, "<", "$ARGV[0]") 
    or die "cannot open < $ARGV[0]: $!";

while(my $text = <TEXT>)
{
    my @lines = split /\n/, $text;

    foreach my $line (@lines) {
        if ($line =~ /(^(.+)\t(.+)\t(.+)$)/){
            my $english_sentence = $2;
            my $french_sentence = $3;
            my $score = $4;

            print $english_sentence."#".$french_sentence."";

            my @data = map [ split /;/ ], $line =~ / \[ ( [^\[\]]+ ) \] /xg;
            dd \@data;
        }   
        print "\n";
    }
}
close TEXT;

Here is the output:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
Array==>[["chats", "chaton", "chatterie"], ["lapins", "lapereau"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
Array==>[["père", "parent", "papa"]]

I need to delete the strings in the array when this string match with a part of the sentence. Finally, I'd like to have this results:

 I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
 [["chats"], ["lapins"]]

 My father [père;parent;papa] lives in New-York # Mon père vit à New-York
 [["père"]]

Re "I need to delete the strings in the array when this string match with a part of the sentence.", Your output seems to show you doing exactly the opposite? — ikegami
– ikegami, Commented Nov 21, 2014 at 21:08
1. For every array, create a hash whose keys are the array values. (The values of the hash elements don't matter.) 2. Split the sentence into words. 3. For every word, for every hash, delete the word from the hash. 4. For each hash, create an array from the keys of the hash. — ikegami
– ikegami, Commented Nov 21, 2014 at 21:12

Borodin · Accepted Answer · 2014-11-22 17:11:25Z

1

This will do as you ask. It just uses grep with a regex to reduce each list to only those words that appear in the French sentence.

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :encoding(UTF-8) /;

use Data::Dump;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  my @data = map [ split /;/ ], $sentences[0] =~ / \[ ( [^\[\]]+ ) \] /xg;
  $_ = [ grep { $sentences[1] =~ /\b\Q$_\E\b/ } @$_ ] for @data;

  dd \@data;
  print "\n";
}

output

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
[["chats"], ["lapins"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
[["p\xE8re"]]

Update

As requested, this code will modify the word lists in-place so that they contain only words that appear in the translation.

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :utf8 /;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  $sentences[0] =~ s{ \[ ( [^\[\]]+ ) \] }{
    my @words = split /;/, $1;
    @words = grep { $sentences[1] =~ /\b\Q$_\E\b/ } @words;
    sprintf "[%s]", join ';', @words;
  }exg;

  print join(' # ', @sentences[0,1]), "\n\n";

}

output

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
My father [père] lives in New-York # Mon père vit à New-York

edited Nov 22, 2014 at 17:11

answered Nov 22, 2014 at 4:17

Borodin

127k9 gold badges72 silver badges146 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chester Mc Allister Over a year ago

It works well. Do you think that I can have directly this output My father ["père"] live in New-York # Mon père vit à New-York

Borodin Over a year ago

@ChesterMcAllister: I've added to my solution. It would be a much more encouraging if you would try to make these changes for yourself. Unlike forums where you can expect a customised response, Stack Overflow considers you the least important reader of its solutions.

score 0 · Accepted Answer · 2014-11-23 20:43:04Z

0

You can also do this by creating a hash of the French sentence's words.
This might be quicker since it avoids a third regex.

use strict;
use warnings;

while (<DATA>) {
    my ($English, $French, $repl, %FrWords);
    if ( ($English, $French) = m/^([^#]*)\#([^#]*)\#/ ) {
        @FrWords{ split /\h+/, $French } = undef;
        $English =~ s{ \[ ([^\[\]]*) \] }{
                 $repl = join( ';', grep { exists $FrWords{$_} } split /;/, $1 );
                 '['. (length($repl) ? $repl : '') .']';
            }xeg;
        print $English, '#', $French, "\n";
    }
}
__DATA__

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

Output

I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins 
My father [père] lives in New-York # Mon père vit à New-York

edited Nov 23, 2014 at 20:43

answered Nov 23, 2014 at 20:22

user557597

2 Comments

Chester Mc Allister Over a year ago

It works well for my sample data but in my complete file i can have one word corresponding to 2 or more words. For example: younger==>plus jeune

user557597 Over a year ago

In reality, the code does:

Younger [plus;jeune] father [père;parent;papa] # plus jeune père    # 1.8  ==> Younger [plus;jeune] father [pΦre] # plus jeune pΦre

. The real of your problem though is determining where words start and end. Unless you get a handle on natural language, you have little hope in making this a success.

Collectives™ on Stack Overflow

Delete strings from an array when this string match with a part of a sentence - Perl

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related