0

The following problem arose in the implementation of a function that loads Huggingface's machine learning model on Lambda, processes it, and returns the results.

Problem

OSError: [Errno 30] Unable to resolve error for read-only file system: '/home/sbx_user1051'. It is logged in CloudWatch Logs. I would like to resolve this.

Current Situation

Source code will be posted later.

I have a function that sends an audio file to Lambda via POST, and when received, reads and executes the audio file in a machine learning model.

However, POST results in an Internal Server Error on the client side and an OSError: [Errno 30] Read-only file system: '/home/sbx_user1051' in the log on the Lambda.

Source Code

The diarization function in the speaker_diarization.py file is where the problem occurs. It occurs when the client POSTs to https://aws URI/dirization of Lambda.

It is not that the hosting is not working, because we were able to display "Hello World" only.

speaker_diarization.py

import os
from pyannote.audio import Pipeline
from fastapi import UploadFile
import io
from dotenv import load_dotenv
from pydub import AudioSegment
import numpy as np
import torch


def annotation_to_dict(annotation):
    result_dict = {}
    for segment, _, speaker in annotation.itertracks(yield_label=True):
        start = segment.start
        end = segment.end
        if speaker not in result_dict:
            result_dict[speaker] = []
        result_dict[speaker].append({"start": start, "end": end})
    return result_dict


async def diarization(file: UploadFile):
    """Execute Speaker Diarization"""
    load_dotenv()
    audio_data = await file.read()
    audio_buffer = io.BytesIO(audio_data)
    audio_segment = AudioSegment.from_file(
        audio_buffer, file.content_type.split("/")[-1]
    )
    waveform_data = np.array(audio_segment.get_array_of_samples())
    waveform_tensor = torch.tensor(waveform_data, dtype=torch.float32) / (1 << 15)
    waveform_tensor = waveform_tensor = waveform_tensor.view(audio_segment.channels, -1)

    pipeline = Pipeline.from_pretrained(
        "pyannote/[email protected]",
        use_auth_token=os.environ["HUGGING_API_KEY"],
        cache_dir="/tmp",
    )
    # Pass the torch.Tensor with waveform data to the pipeline
    annotation = pipeline(
        {"waveform": waveform_tensor, "sample_rate": audio_segment.frame_rate}
    )

    result = annotation_to_dict(annotation)


    return result

app.py

from fastapi import FastAPI, File, UploadFile
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import uvicorn
import os
from mangum import Mangum

from speaker_diarization import diarization

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_origin_regex="https?://.*",
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)
handler = Mangum(app)


@app.post("/diarization")
async def post_diarization(file: UploadFile = File(...)):
    os.environ["TRANSFORMERS_CACHE"] = "/tmp"
    res = await diarization(file)

    print({"text": res})
    return JSONResponse({"text": res})


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=9000)

What i tried

Lambda does not allow write or delete operations except for tmp files. Therefore, file write operations are now placed in tmp or are not allowed to be written to in the first place.

Therefore, the process of temporarily saving a file on the file itself has been deleted.

The cache file of Huggingface is created by the process of Payannote, so the cache is saved in the tmp folder by specifying the absolute path of /tmp.

Before executing the Pyannote function, os.environ["TRANSFORMERS_CACHE"] = "/tmp" so that the cache is saved to tmp (just in case).

Add ENV PYTHONDONTWRITEBYTECODE=1 to the Dockerfile to prevent pycache files from being generated.

I have taken the above measures, but I have not been able to think of anything else that might be wrong with the above measures or some other factor. Specifically, I could not further investigate which part of the library or process was responsible for starting to write to /home/sbx_user1051 on its own. Therefore, I decided to post this on Stackoverflow.

The answer in the link below should throw the library into tmp. but it is difficult because the size of the library exceeds 512MB. Also, AWS clears that folder when the instance is started. AWS lambda read-only file system error, using docker image to store ML model

2
  • If your data needs to persist across Lambda invocations then consider EFS or S3. Note: AWS Lambda now supports up to 10GB ephemeral storage. See ephemeral storage for more. Commented May 13, 2023 at 16:48
  • The docker data size is less than 7GB. I am considering using Amazon SageMaker Serverless as I have not been able to solve the problem. Commented May 23, 2023 at 6:12

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.