Cleaning and stripping of strings/HTML - Python

Question

I have a set of questions, of which I do not have an answer to.

1) Stripping lists of string

input:
'item1,   item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '

output:
['item1', 'item2', 'item3', 'item4', 'item5']

Anything more efficient than doing the following?

[x.strip() for x in l.split(',') if x.strip()]

2) Cleaning/Sanitizing HTML

keeping basic tags e.g. strong, p, br, ...

removing malicious javascript, css and divs

3) Unicode handling...

what would you recommend for dealing with unicode parsed within documents?

Any ideas? :) Thanks guys!

Please split your question into 3, that makes the questions much more helpful for people who might search for something similar. — Georg Schölly
– Georg Schölly, Commented Oct 28, 2010 at 21:35

Alex Rashkov · Accepted Answer · 2010-10-28 21:39:36Z

2

To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()

answered Oct 28, 2010 at 21:39

Alex Rashkov

10k3 gold badges35 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RadiantHex Over a year ago

thanks :) but it doesn't really clean/sanitize the HTML. I just need br, p, strong, italic, span elements :)

bobince Over a year ago

A good way to sanitise HTML is to parse it into a DOM, remove all the elements, attributes, and URL-schemes that aren't known-safe, and serialise back to HTML.

Mark Byers · Accepted Answer · 2010-10-28 21:46:50Z

2

For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]

edited Oct 28, 2010 at 21:46

answered Oct 28, 2010 at 21:38

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

3 Comments

RadiantHex Over a year ago

It would have to be result = [x.strip() for x in i.split(',') if x.strip()], was hoping there would be a more efficient way of doing this though. Well thanks anyway

RadiantHex Over a year ago

btw [x.strip() for x in i.split(',') if x.strip()] does both at the same time :)

hughdbrown Over a year ago

@RadiantHex: ... by performing the strip twice. This answer would be better if the first operation were a generator, not a list comprehension.

hughdbrown · Accepted Answer · 2010-10-29 03:48:49Z

1

I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines.

answered Oct 29, 2010 at 3:48

hughdbrown

49.2k20 gold badges89 silver badges111 bronze badges

Comments

Klaus Byskov Pedersen · Accepted Answer · 2010-10-28 21:40:26Z

1

I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.

answered Oct 28, 2010 at 21:40

Klaus Byskov Pedersen

122k31 gold badges192 silver badges223 bronze badges

Comments

Brandon Frohbieter · Accepted Answer · 2010-10-28 21:41:02Z

1

1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/

answered Oct 28, 2010 at 21:41

Brandon Frohbieter

18.2k4 gold badges42 silver badges62 bronze badges

1 Comment

bobince Over a year ago

Erm... the question appears to be Python, rather than Ruby? The way the two languages handle Unicode is very, very different.

mouad · Accepted Answer · 2010-10-28 21:47:14Z

1

1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

answered Oct 28, 2010 at 21:47

mouad

70.5k18 gold badges117 silver badges106 bronze badges

Collectives™ on Stack Overflow

Cleaning and stripping of strings/HTML - Python

6 Answers 6

2 Comments

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related