0

I want to insert an array that contains empty values created from Pandas in Python, and these empty values default to np.nan in Pandas dataframe. I don't want them to be 'NaN' in my PostgreSQL database, I want my PostgreSQL arrays to contain empty values like this: '{123,24,,23}' so they are not counted in my aggregate functions like calculating the mean or standard deviation across indices. I am not sure if it is possible to have sparse arrays in PostgreSQL. There won't be a lot of sparse arrays in my dataset, I am just testing this for edge case purposes.

My table schema:

create_table = '''
            CREATE TABLE {t} (
                patient_id VARCHAR[20] PRIMARY KEY,
                gene_expression double precision []
            );
        '''

The relevant Python code (I don't know how to write the proper SQL code here). Here I converted the array into a string, because Python arrays cannot be sparse:

df = df.fillna('')
NCI = [1]
MCI = [2,3]
AD = [4,5]
other = [6]

insert_sql = '''
                INSERT INTO {t} (patient_id, gene_expression)
                VALUES (%s,%s);
            '''
cur = psql_conn.cursor()

for index, row in df.iterrows():
    arr = row[2:].tolist()
    postgres_arr = ','.join(map(str, arr))
    if row['DIAGNOSIS'].isdigit():
        if int(row['DIAGNOSIS']) in NCI:
            cur.execute(insert_sql.format(t='nci'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in MCI:
            cur.execute(insert_sql.format(t='mci'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in AD:
            cur.execute(insert_sql.format(t='ad'), (row['PATIENT_ID'], postgres_arr,))

        elif int(row['DIAGNOSIS']) in other:
            cur.execute(insert_sql.format(t='other'), (row['PATIENT_ID'], postgres_arr,))

    elif row['DIAGNOSIS'] == '':
        cur.execute(insert_sql.format(t='na'), (row['PATIENT_ID'], postgres_arr,))

    else:
        print('ERROR: unknown diagnosis {d}.'.format(d=diagnosis))

psql_conn.commit()
cur.close()

My Error:

psycopg2.DataError: malformed array literal: "{2.0,2.4,}"
LINE 3:                     VALUES ('X100_120417','{2.0,2.4,}');
                                                  ^
DETAIL:  Unexpected "}" character.

2 Answers 2

2

After a few hours of trial and error:

Load this Pandas dataframe df from some CSV file:

+----+-------+--------------+
| id | stuff |    array     |
+----+-------+--------------+
|  0 | a     | {1,2,3}      |
|  1 | b     | {1,np.nan,3} |
|  2 | 45    | {np.nan,4,2} |
+----+-------+--------------+

process in pandas using:

df = df.fillna('NULL')
insert_sql = '''
                INSERT INTO {t} (patient_id, gene_expression)
                VALUES (%s,%s);
            '''

for index, row in df.iterrows():
    arr = row[2:].tolist()
    postgres_arr = '{' + ','.join(map(str,arr)) + '}'
    cur.execute(insert_sql.format(t='my_table'), (row['id'], postgres_arr,))

My main issue was recognizing that string literal 'NULL' automatically translate to PostgreSQL NULL keyword, which is ignored in calculations and results of aggregate functions return a value as if the NULL values are not there, versus NaN keyword where every operation with it results in NaN.

Sign up to request clarification or add additional context in comments.

Comments

1

If you want to create a column with max length, use bracket, not square brackets. change VARCHAR[20] to VARCHAR(20) in create table statement. Otherwise first %s is expected to be array and it is varchar. Here is sample - mind that patient_id is created as array, not varchar...

t=# CREATE TABLE so23 (
                patient_id VARCHAR[20] PRIMARY KEY,
                gene_expression double precision []
            );
CREATE TABLE
t=# \d+ so23
                                    Table "public.so23"
     Column      |        Type         | Modifiers | Storage  | Stats target | Description
-----------------+---------------------+-----------+----------+--------------+-------------
 patient_id      | character varying[] | not null  | extended |              |
 gene_expression | double precision[]  |           | extended |              |
Indexes:
    "so23_pkey" PRIMARY KEY, btree (patient_id)

1 Comment

thanks that was a typo, how would you insert NULL instead of 'nan' in the array? Because all operations with 'nan' results in 'nan', but I want to ignore 'nan' values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.