0

I have a df with some records that look like this:

Untitledp { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; } body { font-family: 'Times New Roman';font-style: Normal;font-weight: normal;font-size: 13.3333333333333px; } .Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; } .TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; } .s_F0039783 { telerik-style-type: local;font-size: 13.34px; } .s_45EBF2E0 { telerik-style-type: local;font-family: 'Times New Roman';font-size: 13.3333333333333px;color: #000000; } A sentence that I actually want.

I want to remove the CSS style blocks and only return the sentence at the end. The number of css blocks can be different for each record. All records started with "Untitledp" and end with the text I want (with no style blocks after the text).

How should I clean these blocks? I use BeautifulSoup to clean html tags, but it doesn't apply to these blocks.

1 Answer 1

1

A regex can be used for this, with sub() :

regex = re.compile('.+\s*{.*}')
regex.sub('', s) # s is copy paste of your sample
' A sentence that I actually want.'

At least it works in this example. Be careful though, if there is {} in the sentence you're trying to get, this will fail. However, since sentences don't typically contain these characters...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.