0

I have a set of questions, of which I do not have an answer to.

1) Stripping lists of string

input:
'item1,   item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '

output:
['item1', 'item2', 'item3', 'item4', 'item5']

Anything more efficient than doing the following?

[x.strip() for x in l.split(',') if x.strip()]

2) Cleaning/Sanitizing HTML

keeping basic tags e.g. strong, p, br, ...

removing malicious javascript, css and divs

3) Unicode handling...

what would you recommend for dealing with unicode parsed within documents?


Any ideas? :) Thanks guys!

1
  • 2
    Please split your question into 3, that makes the questions much more helpful for people who might search for something similar. Commented Oct 28, 2010 at 21:35

6 Answers 6

2

To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()
Sign up to request clarification or add additional context in comments.

2 Comments

thanks :) but it doesn't really clean/sanitize the HTML. I just need br, p, strong, italic, span elements :)
A good way to sanitise HTML is to parse it into a DOM, remove all the elements, attributes, and URL-schemes that aren't known-safe, and serialise back to HTML.
2

For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]

3 Comments

It would have to be result = [x.strip() for x in i.split(',') if x.strip()], was hoping there would be a more efficient way of doing this though. Well thanks anyway
btw [x.strip() for x in i.split(',') if x.strip()] does both at the same time :)
@RadiantHex: ... by performing the strip twice. This answer would be better if the first operation were a generator, not a list comprehension.
1

I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines.

Comments

1

I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.

Comments

1

1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/

1 Comment

Erm... the question appears to be Python, rather than Ruby? The way the two languages handle Unicode is very, very different.
1

1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.