I got the python script below and the purpose is to scrape HTML and save it to csv in nest table(s) format.
Can anyone could help? Thank you very much!
I just tried your code(with small modification) and it works:
import pandas as pd
url = 'http://www.aastocks.com/en/cnhk/market/quota-balance/hk-connect'
for i, df in enumerate(pd.read_html(url)):
filename = '/tmp/output_%02d.csv' % i
df.to_csv(filename, encoding='utf-8')
Output:
daniel@synapse:/tmp$ cat output_17.csv
,0,1,2,3,4,5
0,Combined Southbound,,,,,
1,Date,Daily Quota Balance(% of Quota),Money Flow,Buy Trade Value(HKD),Sell Trade Value(HKD),Total Trade Value1 (% of Market Turnover)
2,2018/02/07,20.68B(98.46%),In 322.87M,TBA,TBA,TBA
3,2018/02/06,12.98B(61.81%),In 8.02B,25.70B,18.10B,43.80B(16.98%)
4,2018/02/05,10.76B(51.25%),In 10.24B,19.50B,8.91B,28.41B(16.72%)
5,2018/02/02,15.60B(74.29%),In 5.40B,12.43B,7.05B,19.48B(13.17%)
6,2018/02/01,18.67B(88.90%),In 2.33B,11.60B,9.89B,21.49B(13.90%)
7,2018/01/31,14.89B(70.91%),In 6.11B,14.29B,8.32B,22.61B(12.79%)
8,2018/01/30,17.55B(83.55%),In 3.45B,11.82B,8.86B,20.68B(11.92%)
9,2018/01/29,16.24B(77.35%),In 4.76B,14.98B,10.45B,25.43B(13.27%)
10,2018/01/26,17.53B(83.46%),In 3.47B,12.01B,9.20B,21.21B(11.79%)
11,2018/01/25,18.18B(86.58%),In 2.82B,13.09B,11.13B,24.22B(12.90%)
12,2018/01/24,17.02B(81.07%),In 3.98B,13.42B,10.13B,23.55B(12.54%)
13,2018/01/23,14.72B(70.07%),In 6.28B,15.35B,9.39B,24.74B(12.50%)
14,2018/01/22,14.77B(70.31%),In 6.23B,14.43B,8.21B,22.64B(13.40%)
15,2018/01/19,14.75B(70.26%),In 6.25B,13.86B,7.68B,21.54B(13.25%)
16,2018/01/18,14.78B(70.36%),In 6.22B,15.07B,8.94B,24.01B(12.23%)
17,,,,,,
FYI: page charset=UTF-8
I use excel to open csv and find the contents is garbled.
I didn't find any Chinese chars, but if I will change URL to url = 'http://www.aastocks.com/sc/cnhk/market/quota-balance/hk-connect', so Chinese shown properly.
import pandas as pd
url = 'http://www.aastocks.com/sc/cnhk/market/quota-balance/hk-connect'
for i, df in enumerate(pd.read_html(url, encoding='utf-8')):
filename = '/tmp/output_%02d.csv' % i
df.to_csv(filename, encoding='utf-8')
Check output:
daniel@synapse:/tmp$ head -5 output_17.csv
,0,1,2,3,4,5
0,南向合计,,,,,
1,日期,每日额度馀额(占额度),当日资金流向,买入成交额(港元),卖出成交额(港元),总成交额1(佔大市成交%)
2,2018/02/07,218.69亿(104.14%),流出8.69亿,163.10亿,188.78亿,351.88亿(16.12%)
3,2018/02/06,129.79亿(61.81%),流入80.21亿,257.05亿,180.96亿,438.01亿(16.98%)
So I think the problem with encoding in excel(fix me if Im wrong).
'C:\Users\Lawrence\Desktop\PyTest\output.csv' % i, but what to format? Where is flag for formatting? Do you see my implementation - '/tmp/output_%02d.csv' % i? do you see %02d flag? where is your flag?pandas fails to parse page properly. pandas requires that data will be in HTML tables and static page, i.e. not populated by javascript. In your case table populated by js.
url = u'http://www.aastocks.com/en/cnhk/market/quota-balance/hk-connect'python2orpython3? Probablypython2and you fall in encoding problem.