Work on raw textual data from a scanned catalog.
I only want to keep 2 types of strings:
- begining with a number (artists works)
- containing 2 juxtaposed uppercases letters **with accents **(artists names)
I want easily to remove everything else (with true -false?)
my datas
ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.
expected results
with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.
The data is imported into R with:
readlines ("clipboard")
I am able to identify lines including artist names in capital letters with different regex
e.g.
[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
I am able to identify lines including artworks
^[0-9]+[\s]
Any help would be greatly appreciated.