removing strings from a vector

Question

Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)

I want easily to remove everything else (with true -false?)

my datas

ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.

expected results

with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.

The data is imported into R with:

readlines ("clipboard")

I am able to identify lines including artist names in capital letters with different regex

e.g.

[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']

I am able to identify lines including artworks

^[0-9]+[\s]

Any help would be greatly appreciated.

Wiktor Stribiżew · Accepted Answer · 2015-11-22 21:32:52Z

4

Just a side-note: [:upper:] matches uppercase letters in the current locale (see source). Thus, this solution is good if you work with one locale:

ll <- readLines(textConnection("ÁÀDFDS (artist 1)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
...gdgdgdg (bad string)
7 in commodo enim in laoreet gravida."))
ll[grep("^[[:digit:]]+[[:blank:]]|[[:upper:]]['[:upper:]]", ll)]

See the IDEONE demo

The regex breakdown:

^ - start of string
[[:digit:]]+ - 1 or more digits
[[:blank:]] - 1 space or tab
| - or
[[:upper:]]['[:upper:]] - an uppercase letter followed by ' or another uppercase letter.

And here is a way to achieve what you need with a Perl-like regex:

ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)]

The regex matches:

^ - start of string
\\d+\\s - 1 or more digits and then a whitespace
| - or...
\\p{Lu}['\\p{Lu}] - an uppercase Unicode letter followed by either an apostrophe or another uppercase Unicode letter.

The output of the sample demo:

[1] "ÁÀDFDS (artist 1)"                                                     
[2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
[3] "AB (artist 2)"                                                         
[4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
[5] "BBDDED (artist 3)"                                                     
[6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
[7] "4 Mauris condimentum velit eu consequat feugiat."                      
[8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
[9] "ÉÈDFSF (artist 4)"                                                     
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."

To clean up the beginning of strings, you can use

ll <- gsub("^[\\P{L}\\D]*?([\\p{L}\\d])", "\\1", ll, perl=T)

The regex ^[\\P{L}\\D]*?([\\p{L}\\d]) matches any non-letters and non-digits as few as possible before a letter or a digit (that are placed into a capturing group), and then restores the captured alphanumeric using the \1 backreference with gsub call. Use it before grepping.

See IDEONE demo

edited Nov 22, 2015 at 21:32

answered Nov 22, 2015 at 20:56

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Wilcar Over a year ago

OK! One final note, some names have an apostrophe (') in second place. How to add into the regex?

Wiktor Stribiżew Over a year ago

I think you need ll[grep("^\\d+\\s|\\p{Lu}['\\p{Lu}]", ll, perl=T)] then

Wilcar Over a year ago

Ok! One last point. I try to clean my text to make sure the numbers and letters start beginning of the line (remove spaces and anything that is not alphanumeric)

Wiktor Stribiżew Over a year ago

Then gsub("^\\W+", "", ll).

Wilcar Over a year ago

Ok, but W+doesn work with accents. Can i use ICU?

|

jeremycg · Accepted Answer · 2015-11-22 20:58:19Z

1

You can use grep:

z<-readlines ("clipboard")
z[grep("^[0-9]|[[:upper:]]{2,}", z)]
 [1] "AADFDS (artist 1)"                                                     
 [2] "1 Lorem ipsum dolor sit amet, consectetur adipiscing elit."            
 [3] "AB (artist 2)"                                                         
 [4] "2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis."
 [5] "BBDDED (artist 3)"                                                     
 [6] "3 Nunc et eros eget turpis sollicitudin mollis id et mi."              
 [7] "4 Mauris condimentum velit eu consequat feugiat."                      
 [8] "5 Suspendisse sit amet metus vitae est eleifend tincidunt."            
 [9] "CCDDFSF (artist 4)"                                                    
[10] "6 Sed cursus augue in tempus scelerisque."                             
[11] "7 in commodo enim in laoreet gravida."

edited Nov 22, 2015 at 20:58

answered Nov 22, 2015 at 20:40

jeremycg

25k6 gold badges67 silver badges78 bronze badges

4 Comments

Wilcar Over a year ago

I wanted to treat separately letters and numbers. Your answer helps me understand that I must think "alphanumeric". Just a note: my names have capital letters with accents. I change my question.

jeremycg Over a year ago

[:upper:] works on Accented Characters - I think the code still works on your new edit?

Wilcar Over a year ago

Ok! but the lines that I wish to delete sometimes begin with an uppercase character when the artist names are written entirely in upper case.

jeremycg Over a year ago

See the edit - the condition is now that there must be at least 2 ({2,}) capital letters at the start

hwnd · Accepted Answer · 2015-11-23 03:59:25Z

1

You can use POSIX character classes if you want. However, their interpretation depends on the current locale and if it's not set properly, it could alter the behavior of the POSIX class.

I'd recommend turning on Perl regular expressions and use Unicode properties.

x <- readLines('clipboard')
r <- x[grepl("^\\pN+|\\p{Lu}[\\p{Lu}']", x, perl=TRUE)]

Another interesting way would be to match the accented letters, dissuading from POSIX.

r <- x[grepl("^\\d+|(?![×Þß÷þø])[A-ZÀ-ÿ][A-ZÀ-ÿ']", x, perl=TRUE)]

You can view the compiled demo of both regular expressions be used.

edited Nov 23, 2015 at 3:59

answered Nov 23, 2015 at 2:29

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

Collectives™ on Stack Overflow

removing strings from a vector

3 Answers 3

6 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related