5

I have the following vector and I want to have the subscript numbers (e.g. ₆, ₂) to be replaced with 'normal' numbers.

vec = c("C₆H₄ClNO₂", "C₆H₆N₂O₂", "C₆H₅NO₃", "C₉H₁₀O₂", "C₈H₈O₃")

I could lookup all subscript values and replace them individually:

gsub('₆', '6', vec)

But isn't there a pattern in regex for it?

There's a similar question for javascript but I couldn't translate it into R.

8
  • 5
    chartr("₀₁₂₃₄₅₆₇₈₉", "0123456789", vec) Commented Sep 19, 2019 at 8:54
  • 2
    @WiktorStribiżew it's not really a duplicate imho, since I'm asking for a pattern in regex for sub/superscripts. But yes, this would be one possibility. Commented Sep 19, 2019 at 8:57
  • 3
    You need no regex here. chartr is from base R, use it here. Commented Sep 19, 2019 at 8:59
  • 1
    Possible duplicate of Using multiple gsubs in one r function Commented Sep 19, 2019 at 17:51
  • 4
    This question is in my opinion significantly different from the proposed duplicate, which is why I voted to undelete and reopen. Besides, chartr is way too underappreciated and deserves more of our love. Commented Sep 25, 2019 at 7:04

2 Answers 2

6

Use chartr:

Translate characters in character vectors

Solution:

chartr("₀₁₂₃₄₅₆₇₈₉", "0123456789", vec)

See the online R demo

BONUS

To normalize superscript digits use

chartr("⁰¹²³⁴⁵⁶⁷⁸⁹", "0123456789", "⁰¹²³⁴⁵⁶⁷⁸⁹")
## => [1] "0123456789"
Sign up to request clarification or add additional context in comments.

Comments

3

We can use str_replace_all from stringr to extract all the subscript numbers, convert it to equivalent integer subtract 8272 (because that is the difference between integer value of and 6 and all other equivalents) and convert it back.

stringr::str_replace_all(vec, "\\p{No}", function(m) intToUtf8(utf8ToInt(m) - 8272))
#[1] "C6H4ClNO2" "C6H6N2O2"  "C6H5NO3"   "C9H10O2"   "C8H8O3" 

As pointed out by @Wiktor Stribiżew "\\p{No}" matches more than subscript digits to only match subscripts from 0-9 we can use (thanks to @thothal )

str_replace_all(vec, "[\U2080-\U2089]", function(m) intToUtf8(utf8ToInt(m) - 8272))

2 Comments

\p{No} matches more than subscript digits and it is not the solution OP needs.
You should replace your regex to [\U2080-\U2089]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.