I am building a python service to compute Greek values using python, but I have never worked on python, and I would appreciate advice from the experienced python devs as it would help me a lot.

FLOW
The service will later integrate with an existing Spring Boot backend.
Currently, I upload an Excel/CSV file (up to 700k rows) from a UI, which contains option data for which I need to calculate Greeks.

I’m using:

  • FastAPI → async API server (for streaming response)

  • Pandas → data manipulation, reading Excel/CSV

  • NumPy → vectorized math

  • SciPy → Black-Scholes & Greeks computations

  • orjson → fast JSON serialization

  • ProcessPoolExecutor → for parallel chunk-based processing

File reading (main process) – pandas for CSV (C engine), openpyxl for Excel

Split into chunks – about 20,000 rows per chunk

Parallel computation (ProcessPoolExecutor)

Vectorized Black-Scholes calculations using NumPy

Error checks (NaN, negatives, type mismatches)

Convert results to dict and calculate aggregates

Merge results – combine all chunk outputs and totals

Serialize & stream – use orjson and StreamingResponse

Below is my performance chart, response time for 700k records through excel is 9-11 secs right now

### 700k Rows

| Configuration        | Read File | Calculate | Build Results | JSON | Total  |
|---------------------|-----------|-----------|---------------|------|--------|
| **Single Process**  | 1-2s      | 5-6s      | 8-10s        | 3-4s | 17-22s |
| **4 Workers**       | 1-2s      | 3-4s*     | 3-4s*        | 3-4s | 10-14s |
| **8 Workers**       | 1-2s      | 2-3s*     | 2-3s*        | 3-4s | 8-12s  |

*Parallel processing time (multiple chunks at once)

### 60k Rows

| Configuration        | Total Time | Notes                              |
|---------------------|------------|------------------------------------|
| **Single Process**  | 2-3s       | No overhead, pure speed            |
| **4 Workers**       | 3-4s       | ⚠️ Overhead > benefit             |
| **8 Workers**       | 4-5s       | ⚠️ Too much overhead            

Questions (Sorry if it sounds stupid but I want to build production-based applications and learn best practises)

Is it ideal to use workers in this api as they take decent amount of memory and might affect the server, do people use it in production and what things to keep in mind?

Is my tech stack (FastAPI + Pandas + NumPy + SciPy + orjson) appropriate for this type of workload, or should I consider something else (e.g., Polars, Cython, or PyPy)?

Apart from JSON serialization overhead, are there other bottlenecks I should be aware of (e.g., inter-process communication, GIL, or I/O blocking)?.

Any help would be appreciated

5 Replies 5

Please elaborate on the way you use FastAPI's workers, and your implementation of API as a whole. If uploading a CSV, preprocessing and all the math is done within one route, one request - you won't really gain any performance by increasing number of FastAPI workers.

Your work is mostly CPU-bound, so it makes sense to separate all the networking and all the maths into different entities. The way I usually do this in my projects is like this: have an API/Producer (FastAPI), which processes incoming requests, if a request is nested - splits it into different tasks and later passes it for workers to process. Workers are replicated and run in parallel, each one processing it's own part of workload. After completing the work, results are passed back to Producer for a response.
More technically speaking, your Producer is FastAPI, for workers I usually go with Celery, which is a popular and solid choice, but there are many others, and you'll need a way for Producer and Worker to communicate - Redis is a good choice.

Adding to that - I'd suggest ditching Pandas and going with Polars, in my experience performance gain is really noticeable. So your workflow will go like that: upload a csv -> split it in chunks -> assign a separate task to process each chunk and execute them in parallel -> gather results and return a response

Thank you for your insights. The idea behind workers is to execute parallel processing, for ex for 700k records, it makes chunks and then assign it to workers for calculations, then merge the results and send it back to client in a json format , my Api structure is request comes on api controller, parsing of excel file, goes to the service layer, data gets splits in chunks, chunks are distributed to workers for maths/calculations , we merge back the calculations and streaming response is sent to client.

My current approach is:

  • Parallel path uses Python’s ProcessPoolExecutor inside the request lifecycle to fan out chunks, then merge and return the response. No Celery/Redis; the HTTP handler orchestrates, workers are OS processes, results are returned in-memory and merged, then sent back.

I will work on ditching pandas; my main issue was the impact of workers when the project is deployed in prod. Tbh I am not familiar with using the redis and producers but again after your comment I went through it and it may be a better approach in the case, so I am going to explore more about it. My focus to reduce the api response time and not use complex processes as I would be the one to debug it heh.

Original DataFrame (700,000 rows)
│
├─ Chunk 1  [0     - 19,999]   → 20,000 rows
├─ Chunk 2  [20,000 - 39,999]   → 20,000 rows
├─ Chunk 3  [40,000 - 59,999]   → 20,000 rows
├─ ...
└─ Chunk 35 [680,000 - 699,999] → 20,000 rows

+1 for ditching pandas in favor of polars.

In addition, ditch openpyxl in favor of fastexcel which polars uses as its default Excel reader anyway.

Don't parallelize manually inside the route. You're not even getting half the speed when you're 4xing the workers. Polars will be so much faster that you won't need to. Make sure you don't use map_elements, that will be slow.

Use polars's df.write_json to go straight to json instead of going through dicts and orjson.

I probably wouldn't bother with streamingresponse because the hassle of doing it right probably isn't worth it. Doing it right means having the results be returned via a generator. Also you'd probably want to return csv or ndjson since json can't be parsed (generally speaking) until the entire thing arrives which negates much of the benefit of streaming.

Use a middleware to return gzipped response unless this API is in an intranet where bandwidth is effectively unlimited.

Thanks a lot dean and Ollie, I am able to achieve 3-4 secs for 700k records. wouldn't be able to do it without your guidance. I will work on optimising it more. Thanks, a tonne once again

Pyxl was the main culprit slowing down the process alot , replacing it with fastexcel was very effective, ditching pandas was absolutely worth it .

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.