So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s> and there is no space between <s>(cid:131)</s> and the next word.
So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s> to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s>. I'm trying to get rid of the <s></s> for this specific case (<s></s> is found in other places in the document and I don't want to replace those).
Doesn't change anything:
csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)
Only changes the part with cid:131:
csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)
▪(seems unlikely). But you could make sure to first replace all<s>and</s>that are around those CID's. For example through:<s>(?=cid:\d{3})|(?<=cid:\d{3})<\/s>. Then after, you can run whatever operation you had going to replace those CID's.<s></s>tags, though. Let me try that.<s>(?=\(cid:\d{3})|(?<=\(cid:\d{3}\))<\/s>to get it to work.