4

So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s> and there is no space between <s>(cid:131)</s> and the next word.

So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s> to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s>. I'm trying to get rid of the <s></s> for this specific case (<s></s> is found in other places in the document and I don't want to replace those).

Doesn't change anything:

csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)

Only changes the part with cid:131:

csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)
4
  • 1
    Not sure if all your ` CID:###` need to be replaced by (seems unlikely). But you could make sure to first replace all <s> and </s> that are around those CID's. For example through: <s>(?=cid:\d{3})|(?<=cid:\d{3})<\/s>. Then after, you can run whatever operation you had going to replace those CID's. Commented Feb 27, 2020 at 18:40
  • No, I have a list of CIDs I'm going through across several different fonts and characters. This is one of the only ones with the <s></s> tags, though. Let me try that. Commented Feb 27, 2020 at 19:01
  • 1
    Let me know how it went, just noticed the paranthesis. You might want to replace the suggested pattern by: <s>(?=\(cid:\d{3})|(?<=\(cid:\d{3}\))<\/s> to get it to work. Commented Feb 27, 2020 at 19:20
  • I was able to make it work with @Ben Pap's solution (though I replaced \d with 131. Commented Feb 27, 2020 at 20:01

1 Answer 1

1

You can use the ? quantifier to signify that a group can appear 0 or multiple times.

csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)
Sign up to request clarification or add additional context in comments.

8 Comments

Are you saying that ? would take into account things like: <s><s>cid:131</s></s>? Because this is not my issue, unfortunately. And I just tried it and it only changes cid:131 to ▪. Thanks for helping, though!
hmm, it should work, take a look at this regex101.com/r/fi2KCk/2 It fully matches both cases. Can we get the actual text from csv_of_table?
Hmm. Here's an example: <s>(cid:131)</s>Marine Environment
OP mentioned he does not want to replace just any CID by "▪". I guess if your suggestion ends up working thats exactly what it will do.
The ( actually make a world of difference, especially with regex! My edit should fix this
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.