0

I have a large numpy array of strings, where some elements of the array are good strings, some have special characters (typically at the start of the string and some have substrings in various quotes inside of it). I want to identify the elements which have a string inside of the string, store the string inside and remove it from my original string.

example:


my_array = ['# this is the "Sharpest" hashtag ever', 'life as we know it', '" what would you do?', 'this was an "arbitrary" result',  'what do you mean']

corrected_array = ['# this is the hashtag ever', 'life as we know it', '" what would you do?',
                   'this was an result', 'what do you mean']

As you can see the words "Sharpest" and "arbitrary" were removed from the corrected array. Is there a way where I can identify the substrings and remove them from my original string efficiently

4
  • 2
    Some of the strings inside my_array are invalid, causing a syntax error. You're going to have to fix that while you build that list. Show the code for how my_array is created. Commented Aug 14, 2020 at 15:38
  • 2
    so you want to drop every string encased between quotes? As @GAEfan says, '# this is the 'Sharpest' hashtag ever' is an invalid string, so you probably have to change the encasing quotes for the string or the substring Commented Aug 14, 2020 at 15:39
  • I just noticed that, the strings are valid in each element, it was a syntax error from my end when asking the question, but initial overall question stands Commented Aug 14, 2020 at 15:50
  • Does this answer your question? How to delete the words between two delimiters? Commented Aug 14, 2020 at 15:51

3 Answers 3

2

try this

import re
corrected_array = [re.sub('"[^"]*"', '', s.replace("'", '"')) for  s in my_array]
Sign up to request clarification or add additional context in comments.

Comments

0

you can try a brute force approach in identifying the index associated to the first " and the similarily the last quote and then exempt all elements in the list where the first and last quotes are found

Comments

0

You can use re.sub

import re

[re.sub('["\']([^"]*)["\']', "", s) for s in my_array]
['# this is the  hashtag ever', 'life as we know it', '" what would you do?', 't
his was an  result', 'what do you mean']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.