2

I have multiple csv files, named as 2C-BEB-29-2009-01-18.csv,2C-BEB-29-2010-02-18.csv,2C-BEB-29-2010-03-28.csv, 2C-ISI-12-2010-01-01.csv, and so on.

  • 2C- Part is default in all csv files.

  • BEB means name of the recording device

  • 29 stands for the user ID

  • 2009-01-18 stands for the date of the recording.

I have around 150 different IDs and their recordings with different devices. I would like to automate the following approach which I have done for a single user ID for all user IDs

When I use the following code for the single user, namely for pattern='2C-BEB-29-*.csv', in string format. Note that I am in the correct directory.

def pd_read_pattern(pattern):
   files = glob.glob(pattern)

   df = pd.DataFrame()
   for f in files:
       csv_file = open(f)
       a = pd.read_csv(f,sep='\s+|;|,', engine='python')
       #date column should be changed depending on patient id
       a['date'] = str(csv_file.name).rsplit('29-',1)[-1].rsplit('.',1)[0]
       
       #df = df.append(a)
       #df = df[df['hf']!=0]
       
       
   return df.reset_index(drop=True)  

To apply the above code for all user IDs, I have read the CSV files in the following way and saved them into a list. To avoid duplicate IDs I have converted the list into set at the end of this snippet.

import glob
lst=[]
for name in glob.glob('*.csv'):
    if len(name)>15:
        a = name.split('-',3)[0]+"-"+name.split('-',3)[1]+"-"+name.split('-',3)[2]+'-*'
        lst.append(a)
lst = set(lst)

Now, having names of unique Ids in this example format: '2C-BEB-29-*.csv'. Withe the help of below code snippet, I am trying to read user IDs. However, I get unicode/decode error in the pd.read_csv row. Could you help me with this issue?

for file in lst:
    #print(type(file))
    files = glob.glob(file)
    #print(files)
    df = pd.DataFrame()
    for f in files:
        csv_file = open(f)
        #print(f, type(f))
        a = pd.read_csv(f,sep='\s+|;|,', engine='python')

        #date column should be changed depending on patient id
        #a['date'] = str(csv_file.name).rsplit(f.split('-',3)[2]+'-',1)[-1].rsplit('.',1)[0]

        #df = df.append(a)
        #df = df[df['hf']!=0]


    #return df.reset_index(drop=True)
2
  • Given a list of files in the current path lst, Are you sure you need to glob again? Did you instead intend to run pandas.read_csv() on file rather than f? Commented Sep 8, 2022 at 9:54
  • If I do not use the gob again, csv_file = open(f) cannot find the files because I try to locate files with the following wildcard format: '2C-BEB-29-*.csv' Commented Sep 8, 2022 at 10:10

1 Answer 1

2

Firstly,

import chardet

Then, replace your code snippet of

a =  pd.read_csv(f,sep='\s+|;|,', engine='python')

with this one

with open(f, 'rb') as file: 
   encodings = chardet.detect(file.read())["encoding"] 
   a =  pd.read_csv(f,sep='\s+|;|,', engine='python', encoding=encodings)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.