0

I'm matching multiple patterns in a string to populate an array. The input file looks like this:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

I use this code:

use strict;
use warnings;
use Data::Dump;

open(TEXT, "<", "$ARGV[0]") 
    or die "cannot open < $ARGV[0]: $!";

while(my $text = <TEXT>)
{
    my @lines = split /\n/, $text;

    foreach my $line (@lines) {
        if ($line =~ /(^(.+)\t(.+)\t(.+)$)/){
            my $english_sentence = $2;
            my $french_sentence = $3;
            my $score = $4;

            print $english_sentence."#".$french_sentence."";

            my @data = map [ split /;/ ], $line =~ / \[ ( [^\[\]]+ ) \] /xg;
            dd \@data;
        }   
        print "\n";
    }
}
close TEXT;

Here is the output:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
Array==>[["chats", "chaton", "chatterie"], ["lapins", "lapereau"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
Array==>[["père", "parent", "papa"]]

I need to delete the strings in the array when this string match with a part of the sentence. Finally, I'd like to have this results:

 I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
 [["chats"], ["lapins"]]

 My father [père;parent;papa] lives in New-York # Mon père vit à New-York
 [["père"]]
2
  • Re "I need to delete the strings in the array when this string match with a part of the sentence.", Your output seems to show you doing exactly the opposite? Commented Nov 21, 2014 at 21:08
  • 1. For every array, create a hash whose keys are the array values. (The values of the hash elements don't matter.) 2. Split the sentence into words. 3. For every word, for every hash, delete the word from the hash. 4. For each hash, create an array from the keys of the hash. Commented Nov 21, 2014 at 21:12

2 Answers 2

1

This will do as you ask. It just uses grep with a regex to reduce each list to only those words that appear in the French sentence.

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :encoding(UTF-8) /;

use Data::Dump;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  my @data = map [ split /;/ ], $sentences[0] =~ / \[ ( [^\[\]]+ ) \] /xg;
  $_ = [ grep { $sentences[1] =~ /\b\Q$_\E\b/ } @$_ ] for @data;

  dd \@data;
  print "\n";
}

output

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
[["chats"], ["lapins"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
[["p\xE8re"]]

Update

As requested, this code will modify the word lists in-place so that they contain only words that appear in the translation.

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :utf8 /;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  $sentences[0] =~ s{ \[ ( [^\[\]]+ ) \] }{
    my @words = split /;/, $1;
    @words = grep { $sentences[1] =~ /\b\Q$_\E\b/ } @words;
    sprintf "[%s]", join ';', @words;
  }exg;

  print join(' # ', @sentences[0,1]), "\n\n";

}

output

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
My father [père] lives in New-York # Mon père vit à New-York
Sign up to request clarification or add additional context in comments.

2 Comments

It works well. Do you think that I can have directly this output My father ["père"] live in New-York # Mon père vit à New-York
@ChesterMcAllister: I've added to my solution. It would be a much more encouraging if you would try to make these changes for yourself. Unlike forums where you can expect a customised response, Stack Overflow considers you the least important reader of its solutions.
0

You can also do this by creating a hash of the French sentence's words.
This might be quicker since it avoids a third regex.

use strict;
use warnings;

while (<DATA>) {
    my ($English, $French, $repl, %FrWords);
    if ( ($English, $French) = m/^([^#]*)\#([^#]*)\#/ ) {
        @FrWords{ split /\h+/, $French } = undef;
        $English =~ s{ \[ ([^\[\]]*) \] }{
                 $repl = join( ';', grep { exists $FrWords{$_} } split /;/, $1 );
                 '['. (length($repl) ? $repl : '') .']';
            }xeg;
        print $English, '#', $French, "\n";
    }
}
__DATA__

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

Output

I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins 
My father [père] lives in New-York # Mon père vit à New-York     

2 Comments

It works well for my sample data but in my complete file i can have one word corresponding to 2 or more words. For example: younger==>plus jeune
In reality, the code does: Younger [plus;jeune] father [père;parent;papa] # plus jeune père # 1.8 ==> Younger [plus;jeune] father [pΦre] # plus jeune pΦre. The real of your problem though is determining where words start and end. Unless you get a handle on natural language, you have little hope in making this a success.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.