0

I have below code that reads from a csv file a number of ticker symbols into a dataframe.
Each ticker calls the Web Api returning a dafaframe df which is then attached to the last one until complete.
The code works , but when a large number of tickers is used the code slows down tremendously.
I understand I can use multiprocessing and threads to speed up my code but dont know where to start and what would be the most suited in my particular case.

What code should I use to get my data into a combined daframe in the fastest possible manner?

import pandas as pd
import numpy as np
import json

tickers=pd.read_csv("D:/verhuizen/pensioen/MULTI.csv",names=['symbol','company'])
read_str='https://financialmodelingprep.com/api/v3/income-statement/AAPL?limit=120&apikey=demo'
df = pd.read_json (read_str)
df = pd.DataFrame(columns=df.columns)

for ind in range(len(tickers)):
    read_str='https://financialmodelingprep.com/api/v3/income-statement/'+ tickers['symbol'][ind] +'?limit=120&apikey=demo'
    df1 = pd.read_json (read_str)       
    df=pd.concat([df,df1], ignore_index=True)
  
df.set_index(['date','symbol'], inplace=True)
df.sort_index(inplace=True)

df.to_csv('D:/verhuizen/pensioen/MULTI_out.csv')


The code provided works fine for smaller data sets, but when I use a large number of tickers (>4,000) at some point I get the below error. Is this because the web api gets overloaded or is there another problem?

Traceback (most recent call last):
  File "D:/Verhuizen/Pensioen/Equity_Extractor_2021.py", line 43, in <module>
    data = pool.starmap(download_data, enumerate(TICKERS, start=1))
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00C33E30>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


It keeps giving the same error (for a larger amount of tickers) code is exactly as provided:

def download_data(pool_id, symbols):
    df = []
    for symbol in symbols:
        print("[{:02}]: {}".format(pool_id, symbol))
        #do stuff here
        read_str = BASEURL.format(symbol)
        df.append(pd.read_json(read_str))
        #df.append(pd.read_json(fake_data(symbol)))
    return pd.concat(df, ignore_index=True)

It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) Similar error:

Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 21, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x078D1BF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


I get the tickers also from a API call https://financialmodelingprep.com/api/v3/financial-statement-symbol-lists?apikey=demo (I noticed it does not work without subscription), I wanted to attach the data it as a csv file but I dont have sufficient rights. I dont think its a good idea to paste the returned data here...


I tried adding time.sleep(0.2) before return as suggested, but again I ge the same error at ticker 12,510. Strange everytime its around the same location. As there are multiple processes going on I cannot see at what point its breaking

Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 24, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00F32C90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

Process finished with exit code 1


Something very very strange is going on , I have split the data in chunks of 10,000 / 5,000 / 4,000 and 2,000 and each time the code breaks approx 100 tickers from the end. Clearly there is something going on that not right

import time
import pandas as pd
import multiprocessing

# get tickers from your csv
df=pd.read_csv('D:/Verhuizen/Pensioen/All_Symbols.csv',header=None)

# setting the Dataframe to a list (in total 23,000 tickers)
df=df[0]
TICKERS=df.tolist()

#Select how many tickers I want
TICKERS=TICKERS[0:2000]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"

def download_data(symbol):
    print(symbol)
    # do stuff here
    read_str = BASEURL.format(symbol)
    df = pd.read_json(read_str)
    #time.sleep(0.2)
    return df

if __name__ == "__main__":
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        data = pool.map(download_data, TICKERS)
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
    df.to_csv('D:/verhuizen/pensioen/Income_2000.csv')

In this particular example the code breaks at position 1,903


RPAI
Traceback (most recent call last):
  File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_testing.py", line 27, in <module>
    data = pool.map(download_data, TICKERS)
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0793EAF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

13
  • Can you copy/paste the content of your download_data function, please? Commented Mar 19, 2021 at 7:07
  • It keeps giving the same error, code is exactly as provided: def download_data(pool_id, symbols): df = [] for symbol in symbols: print("[{:02}]: {}".format(pool_id, symbol)) #do stuff here read_str = BASEURL.format(symbol) df.append(pd.read_json(read_str)) #df.append(pd.read_json(fake_data(symbol))) return pd.concat(df, ignore_index=True) Commented Mar 19, 2021 at 8:03
  • Use the second simpler version with pool.map instead of the first one with pool.starmap. We will get there! I think you have a rate limitation from your api. Commented Mar 19, 2021 at 8:29
  • It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) I get the tickers from a API call , I will attach it as a csv file Commented Mar 19, 2021 at 9:18
  • API calls are limited depends on subscription except for Enterprise plan. If each request take 100ms so you can do 600 calls / minute. However, you must not exceed 300 calls for Start plan but it's right for Professional one (< 750 calls). You can introduce a delay with time.sleep(0.2) before return. Commented Mar 19, 2021 at 14:12

1 Answer 1

2

First optimization is to avoid concatenate your dataframe at each iteration.
You can try something like that:

url = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
df = []

for symbol in tickers["symbol"]:
    read_str = url.format(symbol)
    df.append(pd.read_json(read_str))

df = pd.concat(df, ignore_index=True)

If it's not sufficient, we will see to use async, threading or multiprocessing.

Edit:
The code below can do the job:

import pandas as pd
import numpy as np
import multiprocessing
import time
import random

PROCESSES = 4  # number of parallel process
CHUNKS = 6  # one process handle n symbols

# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
           "XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
           "TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
           "AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
           "MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
           "CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]

# create a list of n sublist
TICKERS = [TICKERS[i:i + CHUNKS] for i in range(0, len(TICKERS), CHUNKS)]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"


def fake_data(symbol):
    dti = pd.date_range("1985", "2020", freq="Y")
    df =  pd.DataFrame({"date": dti, "symbol": symbol,
                        "A": np.random.randint(0, 100, size=len(dti)),
                        "B": np.random.randint(0, 100, size=len(dti))})
    time.sleep(random.random())  # to simulate network delay
    return df.to_json()


def download_data(pool_id, symbols):
    df = []
    for symbol in symbols:
        print("[{:02}]: {}".format(pool_id, symbol))
        # do stuff here
        # read_str = BASEURL.format(symbol)
        # df.append(pd.read_json(read_str))
        df.append(pd.read_json(fake_data(symbol)))
    return pd.concat(df, ignore_index=True)


if __name__ == "__main__":
    with multiprocessing.Pool(PROCESSES) as pool:
        data = pool.starmap(download_data, enumerate(TICKERS, start=1))
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()

In this example, I split the list of tickers into sublists for each process retrieves data for multiple symbols and limits overhead due to create and destroy processes.

The delay is to simulate the response time from the network connection and highlight the multiprocess behaviour.

Edit 2: simpler but naive version for your needs

import pandas as pd
import multiprocessing

# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
           "XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
           "TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
           "AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
           "MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
           "CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]

BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"


def download_data(symbol):
    print(symbol)
    # do stuff here
    read_str = BASEURL.format(symbol)
    df = pd.read_json(read_str)
    return df


if __name__ == "__main__":
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        data = pool.map(download_data, TICKERS)
        df = pd.concat(data).set_index(["date", "symbol"]).sort_index()

Note about pool.map: for each symbol in TICKERS, create a process and call function download_data.

Sign up to request clarification or add additional context in comments.

6 Comments

Works (and I often work like this), but I would not call the list df but ldf(for list of dataframes) to avoid confusion.
You use 2 variables ldfand df so you consume twice as much memory. dfas a list is only used in the loop then df as a dataframe can be used anywhere later in the code, so I prefer overwrite the temporary list by the final dataframe.
Thanks, there is some 15 % speed increase, but was hoping for a multitude How would I use multprocessesing and what speed increase could I expect there?
I have around 10,000 symbols and each symbol has three separate web calls to get a json with the three financial statements. But looking at your code, wow I think I am in way over my head.... I (think) I understand why you are simulating the response time, but I really dont know how to implement this now. I see you commented out the code that calls my real data. But how do I actually call the download_data function.? I suppose the pool_id is one of the 4 processes? If I would get this to work its a black box I really need to study!
Wow amazing fast!, its working! , thanks so much. Initially I tried to run in jupyter notebook but somehow it did not work, But when I ran it inside Pycharm IDE it worked fine!. I dont understand most of the code but I am getting results. I tried changing the processes (set at 4) and chunks (set at 6) but for each change (upwards) its giving an error. Is there a maximum to these sizes and it there still a way to optimise the speed by adjusting these or it this the top speed now?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.