Passing a pandas dataframe to FastAPI for NLP ML

Question

I am trying to, for the first time, deploy an NLP ML model. To do this it was suggested that I use FastAPI and uvicorn. I have had some success in getting FastAPI to respond; however, I have not been able to successfully pass the dataframe and have it process it. I've tried using dictionaries and even attempted to convert the passed json to a dataframe.

With data_dict = data.dict() I get: ValueError: Iterable over raw text documents expected, string object received.

With data_dict = pd.DataFrame(data.dict()) I get: ValueError: If using all scalar values, you must pass an index

I believe I understand the problem, my Data class is expecting a string which this is not; however, I have not been able to determine how to set and / or pass the expected data so that fit_transform() will work. Ultimately I will have a prediction returned based on the submitted messages value. Bonus if I can pass a dataframe of 1 or more rows and have predictions made and returned for each of the rows. The response will include the id, project, and the prediction so that we are in future able to leverage this response to post the prediction back to the original (requesting) system.

test_connection.py

#%%
import requests
import pandas as pd
import json
import os
from pprint import pprint

url = 'http://127.0.0.1:8000/predict'
print(os.getcwd())
#%%
df = pd.DataFrame(
    {
        'id': ['ab410483801c38', 'cd34148639180'],
        'project': ['project1', 'project2'], 
        'messages': ['This is message 1', 'This is message 2']
    }
)
to_predict_dict = df.iloc[0].to_dict()
#%%
r = requests.post(url, json=to_predict_dict)

main.py

#!/usr/bin/env python
# coding: utf-8

import pickle
import pandas as pd
import numpy as np
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Server
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb


app = FastAPI()

clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))

class Data(BaseModel):
    # id: str
    project: str
    messages: str

@app.get("/ping")
async def test():
    return {"ping": "pong"}

@app.post("/predict")
async def predict(data: Data):
#    data_dict = data.dict()
    data_dict = pd.DataFrame(data.dict())
    tfidf_vect = TfidfVectorizer(stop_words="english", analyzer='word', token_pattern=r'\w{1,}')
    tfidf_vect.fit_transform(data_dict['messages'])
#   to_predict = tfidf_vect.transform(data_dict['messages'])
#   prediction = clf.predict(to_predict)

    return {"response": "Success"}

can't you do it without DataFrame in main.py ? fit_transform(data.messages) ? — furas
– furas, Commented Jul 31, 2020 at 7:49
No that’s when I get the ValueError String Received. I apologize this wasn’t clear in my post, but those errors actually occur at the fit_transform() step. — Eric
– Eric, Commented Jul 31, 2020 at 11:54
I’ll add that I haven’t tried with the dot notation, I’ve only tried with brackets. Not sure there’s a difference but will give it a try. — Eric
– Eric, Commented Jul 31, 2020 at 11:57
Skipping the whole data_dict = data.dict() and simply using data.messages did not work. The issue is my Data class where I have defined data features as str and fit_transofrm is expecting raw text documents. — Eric
– Eric, Commented Jul 31, 2020 at 13:26
my mistake - name messages was missleading - I thought it gives list of messages. For single message (single string) I would use name message without s — furas
– furas, Commented Jul 31, 2020 at 13:34

Eric · Accepted Answer · 2020-07-30 23:03:38Z

1

Probably not the most elegant solution but I've made progress using the following:

def predict(data: Data):
    data_dict = pd.DataFrame(
        {
            'id': [data.id],
            'project': [data.project],
            'messages': [data.messages]
        }
    )

answered Jul 30, 2020 at 23:03

Eric

6663 gold badges9 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eric Over a year ago

Uncommenting the remaining code, tfidf_vect, to_predict, prediction, and attempting to return {"Prediction": prediction} results in a dump of data ending in in input data` and an error JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Kenneth Leung Over a year ago

Wouldn't this solution be difficult to implement if I have many (e.g.40+) columns?

Max Power Over a year ago

@KennethLeung great question but I think this answer is extensible to that case using a dict comprehension. e.g.: data_dict = {c: df[c] for c in df.columns}

J. Javier Gálvez · Accepted Answer · 2022-02-23 01:29:19Z

1

Frist, encode your dataFrame df to JSON record-oriented:

r = requests.post(url, json=df.to_json(orient='records')).

Then, decode your data inside the /predict/ endpoint with:

df = pd.DataFrame(jsonable_encoder(data))

Remember to import the module from fastapi.encoders import jsonable_encoder.

edited Feb 23, 2022 at 1:29

answered Feb 22, 2022 at 20:43

J. Javier Gálvez

991 silver badge6 bronze badges

2 Comments

Community Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Kots Over a year ago

for me df.to_json() did the trick!

poldpold · Accepted Answer · 2022-03-28 14:45:06Z

1

A new library called pandera now supports direct passage of DataFrames without conversion via FastAPI. The docs are bit basic as of posting this, but may be worth reading: https://pandera.readthedocs.io/en/latest/fastapi.html#fastapi-integration.

answered Mar 28, 2022 at 14:45

poldpold

536 bronze badges

Comments

Eric · Accepted Answer · 2020-08-03 14:09:34Z

I was able to address the issue by simply converting data.messages into a list. I also had to make some unrelated changes, I had failed to pickle my vectorizer (string tokenizer).

import pickle
import pandas as pd
import numpy as np
import json
import time
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer

# Server / endpoint
import uvicorn
from fastapi import FastAPI
# Model
import xgboost as xgb


app = FastAPI(debug=True)

clf = pickle.load(open('data/xgbmodel.pickle', 'rb'))
vect = pickle.load(open('data/tfidfvect.pickle', 'rb'))

class Data(BaseModel):
    id: str = None
    project: str
    messages: str

@app.get("/ping")
async def ping():
    return {"ping": "pong"}

@app.post("/predict/")
def predict(data: Data):
    start = time.time()
    data_l = [data.messages] # make messages iterable.
    to_predict = vect.transform(data_l)
    prediction = clf.predict(to_predict)

    exec_time = round((time.time() - start), 3)
    return {
        "id": data.id,
        "project": data.project,
        "prediction": prediction[0], 
        "execution_time": exec_time
        }

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000)

Collectives™ on Stack Overflow

Passing a pandas dataframe to FastAPI for NLP ML

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related