processing hundreds of csv files one row at a time for embedding, upload to pinecone using OpenAI embeddings

Ask Question

Asked 1 year, 9 months ago

Modified 1 year, 9 months ago

Viewed 281 times

This is my current code which works for a while and then throws an error of "can't start a new thread." Tried both threading and multi-processing and both cause this error eventually.

def process_file(file_path):
    print(f'file: {file_path}')
    def process_row(row):
    text = row['text']
    row2data = row['row2data']
    year = row['year']
    group_id = row['group_id']
    docs = embedder(text, text, year, group_id)
    my_index = pc_store.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            process_row(row)

if __name__ == '__main__':
    file_paths = ['file1', 'file2', 'file3']
    processes = []

    for file_path in file_paths:
        p = Process(target=process_file, args=(file_path,))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Here is the stack trace of the error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/dummy/__init__.py", line 51, in start
    threading.Thread.start(self)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 971, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

edited Mar 1, 2024 at 14:24

asked Mar 1, 2024 at 4:51

John Taylor

7471 gold badge10 silver badges32 bronze badges

Can you post a minimal process_row that causes this issue with a stack trace?

Booboo
– Booboo

2024-03-01 11:07:50 +00:00
Commented Mar 1, 2024 at 11:07
Edited the original post to include the process row functions. They are creating embeddings from the text column of each row and then adding the other items as metadata to a pinecone vector db import.

John Taylor
– John Taylor

2024-03-01 14:25:26 +00:00
Commented Mar 1, 2024 at 14:25
Your error does not match your code. The error is from multiprocessing.dummy.pool which does not show up in your code. The "can't start new thread" error is likely due to starting too many concurrent threads (search ulimit). If you are creating a thread pool each in a large process pool, your number of threads will multiply very quickly (past what will likely give a performance benefit I might add).

Aaron
– Aaron

2024-03-01 15:11:02 +00:00
Commented Mar 1, 2024 at 15:11
I think you are right. How do I accomplish this multi-processing and just limiting it to 10 processes? I basically need my code to run against ten separate files at the same time and process them. That's it.

John Taylor
– John Taylor

2024-03-01 15:13:53 +00:00
Commented Mar 1, 2024 at 15:13
Hard to tell without knowing more about your processing function where you create the thread pool, but I'd start with using a multiprocessing.pool with 10 processes rather than creating a new process for each file.

Aaron
– Aaron

2024-03-01 15:17:35 +00:00
Commented Mar 1, 2024 at 15:17

| Show 4 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

processing hundreds of csv files one row at a time for embedding, upload to pinecone using OpenAI embeddings

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest