4

Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)

I want easily to remove everything else (with true -false?)

my datas

ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.

expected results

with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.

The data is imported into R with:

readlines ("clipboard")

I am able to identify lines including artist names in capital letters with different regex

e.g.

[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']

I am able to identify lines including artworks

^[0-9]+[\s]

Any help would be greatly appreciated.

0

3 Answers 3

4

Just a side-note: [:upper:] matches uppercase letters in the current locale (see source). Thus, this solution is good if you work with one locale:

ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]

See the IDEONE demo

The regex breakdown:

  • ^ - start of string
  • [[:digit:]]+ - 1 or more digits
  • [[:blank:]] - 1 space or tab
  • | - or
  • [[:upper:]]['[:upper:]] - an uppercase letter followed by ' or another uppercase letter.

And here is a way to achieve what you need with a Perl-like regex:

ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]

The regex matches:

  • ^ - start of string
  • \\d+\\s - 1 or more digits and then a whitespace
  • | - or...
  • \\p{Lu}['\\p{Lu}] - an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter.

The output of the sample demo:

[1] "ÁÀDFDS (artist 1)"                                                     
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
[3] "AB (artist 2)"                                                         
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"                                                     
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
[7] "4 Mauris condimentum velit eu consequat feugiat."                      
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
[9] "ÉÈDFSF (artist 4)"                                                     
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."    

To clean up the beginning of strings, you can use

ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)

The regex ^[\\P{L}\\D]*?([\\p{L}\\d]) matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \1 backreference with gsub call. Use it before grepping.

See IDEONE demo

Sign up to request clarification or add additional context in comments.

6 Comments

OK! One final note, some names have an apostrophe (') in second place. How to add into the regex?
I think you need ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)] then
Ok! One last point. I try to clean my text to make sure the numbers and letters start beginning of the line (remove spaces and anything that is not alphanumeric)
Then gsub("^\\W+", "", ll).
Ok, but W+doesn work with accents. Can i use ICU?
|
1

You can use grep:

z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
 [1] "AADFDS (artist 1)"                                                     
 [2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
 [3] "AB (artist 2)"                                                         
 [4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
 [5] "BBDDED (artist 3)"                                                     
 [6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
 [7] "4 Mauris condimentum velit eu consequat feugiat."                      
 [8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
 [9] "CCDDFSF (artist 4)"                                                    
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."  

4 Comments

I wanted to treat separately letters and numbers. Your answer helps me understand that I must think "alphanumeric". Just a note: my names have capital letters with accents. I change my question.
[:upper:] works on Accented Characters - I think the code still works on your new edit?
Ok! but the lines that I wish to delete sometimes begin with an uppercase character when the artist names are written entirely in upper case.
See the edit - the condition is now that there must be at least 2 ({2,}) capital letters at the start
1

You can use POSIX character classes if you want. However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class.

I'd recommend turning on Perl regular expressions and use Unicode properties.

x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]

Another interesting way would be to match the accented letters, dissuading from POSIX.

r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]

You can view the compiled demo of both regular expressions be used.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.