0

I want to handle character strings up to 100K and write into a csv file into different columns. (basically trying to overcome excel cell limitation of 32K)

Below is sample code:

soup = BeautifulSoup(r.content, 'html5lib')
html = str(soup.select('div.DocumentText'))
if len(html) > 32000:
   #How to handle here and assign to different variable ex: html1, html2 is the question 
   x.writerow([html_1,......, html_5])
  

Example flow trying to achieve

  • Scrape website
  • If scraped data characters are greater than 32000 and less than 100K
  • split the scraped into different variable
  • write each variable into different columns of CSV file
6
  • do you mean that you want to split c.case_html into items of size 32k each? Commented Sep 7, 2017 at 6:37
  • You should post an example of a html input and the corresponding csv output you want to get. Commented Sep 7, 2017 at 6:41
  • Does the split need to occur on a word boundary? Commented Sep 7, 2017 at 7:39
  • yes on word boudary Commented Sep 7, 2017 at 8:02
  • I'd use stripped_strings method instead of str function if you're interested in words. crummy.com/software/BeautifulSoup/bs4/doc/… Commented Sep 7, 2017 at 8:22

2 Answers 2

1

Maybe you want to try this. It will split the string into sizes of 32000 (just change the size if you need to) and put them into a list.

if len(html) > 32000:
    #How to handle here and assign to different variable ex: html1, html2 is the question
    output = [html[0+i:32000+i] for i in range(0, len(html), 32000)]
    x.writerow(output)
Sign up to request clarification or add additional context in comments.

Comments

0

Hope this helps anyone... If there is a better way happy to hear.. Limitation is can handle only case_html(string) length upto 98K

def strhandler(case_html, length):
    string = case_html
    return (string[0+i:length+i] for i in range(0, len(string), length)) 

case_html = str(soup.find('div', class_='DocumentText').find_all(['p','center','small']))
char_count = len(c.case_html)
split_no = int(char_count/4)
print('Split this into no.of columns', split_no)
case_html_1, case_html_2, case_html_3, case_html_4, case_html_5 =  list(c.strhandler(case_html,split_no))
csv_writer.writerow([case_html_1, case_html_2, case_html_3, case_html_4, case_html_5,])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.