So I have this string I want to remove non alphanumeric characters from:
my_string = "¿Habla usted Inglés, por favor?"
Basically I want to get rid of the ?, ¿ and , in this case. I then split the words into a list and do various kickass things with each one.
I am using
String.replace(my_string, my_regex, "")
String.split(" ")
to do the work. I have two different regex strings I'm attempting to use:
my_regex = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
my_regex = ~r/[[:punct:]]/
The first one works like a charm. I end up with:
["habla", "usted", "inglés"]
The second one removes the correct characters but I end up with:
[<<194, 104, 97, 98, 108, 97>>, "usted", <<105, 110, 103, 108, 195, 115>>]
At first I thought the strange output was just because of the non-ascii alphas being dumped to the console. But when I attempt to match with the expected list of strings it fails.
Whatever the case, I just don't understand why the two different regex result in different output in terms of the strings in the list.
Here is code that can be run in iex to succinctly reproduce my issue:
a = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
b = ~r/[[:punct:]]/
y = "¿Habla usted Inglés, por favor?"
String.replace(y, a, "")
# -> "Habla usted Inglés por favor"
String.replace(y, b, "")
# -> <<194, 72, 97, 98, 108, 97, 32, 117, 115, 116, 101, 100, 32, 73, 110, 103, 108, 195, 115, 32, 112, 111, 114, 32, 102, 97, 118, 111, 114>>