4

I'm working on a data cleaning task and could use some help. I have two CSV files with thousands of rows each:

File A contains product shipment records. File B contains product descriptions and categories. Here’s a simplified example:

File A (shipments.csv):

shipment_id,product_code,quantity,date
S001,P123,10,2025-07-01
S002,P456,5,2025-07-02
S003,P789,8,2025-07-03 

File B (products.csv):

product_code,description,category
P123,Widget A,Tools
P456,Widget B,Hardware

I want to create a merged file where each row from File A is enriched with the matching product description and category from File B (based on product_code). If there's no match, I’d like to keep the row from File A and fill the missing columns with "N/A".

Expected Output:

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

I tried using pandas.merge() in Python but it drops unmatched rows unless I use how='left', and I’m not sure how to fill missing values properly.

Any help? Thanks in advance!

5 Answers 5

7

As tagged, using awk (note the order of the files):

awk -F , -v OFS=, '
  NR == FNR {a[$1] = $2 OFS $3; next}
  {print $0, ($2 in a? a[$2]: "N/A" OFS "N/A")}
' products.csv shipments.csv 
shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the helpful awk solution! It works well for quick tasks. One small issue: the output includes the product_code column twice. Ideally, I’d like to append only the description and category fields from products.csv, not the entire line. Also, since I'm working with large datasets and doing more data cleaning, I might stick with pandas for better scalability. Appreciate your input!
You were right, fixed it.
It even merges the column header row.
4

Yor way is correct, first you should merge two dfs, based on the product_code , then fill missing values with "N\A". Because when the left dataframe does not find its match on the right dataframe, it automatically puts NaN to those rows.

import pandas as pd
shipments= [
    ["S001", "P123", 10, "2025-07-01"],
    ["S002", "P456", 5, "2025-07-02"],
    ["S003", "P789", 8, "2025-07-03"]
]

columns = ["shipment_id", "product_code", "quantity", "date"]
df_shipments = pd.DataFrame(shipments, columns=columns)

products = [
    ["P123", "Widget A", "Tools"],
    ["P456", "Widget B", "Hardware"]
]

columns = ["product_code", "description", "category"]

df_products = pd.DataFrame(products, columns=columns)

merged_df = pd.merge(df_shipments, df_products,how = 'left').fillna("N/A")

At the last line of the code, it is merging as you said and then we are using fillna to fill those missing values.

Additional: Since you have product_code on both dataframes, it automatically matches those columns. But if you have more than one column that have same name on both dataframes, then it try to match both of them. So if you want to stick to the selected column, then better use left_on and right_on .

Moreover, if you want only category from the right dataframe, you can use something like:

merged_df = pd.merge(df_shipments, df_products[['product_code', 'category']], how = 'left').fillna("N/A") 

This could be better as it does not alter the original dataframe of df_products .

2 Comments

Thanks! This approach using pandas is exactly what I was aiming for. Just a small correction: the variable df in the merge line should be df_shipments, otherwise it throws an error.
Fixed! If so do not forget to accept it as answer.
3

I propose following python solution using csv which is part of standard library, let shipments.csv content be

shipment_id,product_code,quantity,date
S001,P123,10,2025-07-01
S002,P456,5,2025-07-02
S003,P789,8,2025-07-03

and products.csv content be

product_code,description,category
P123,Widget A,Tools
P456,Widget B,Hardware

then create file named merger.py with following content

import csv

OUTFIELDS = ['shipment_id', 'product_code', 'quantity', 'date', 'description', 'category']

if __name__ == '__main__':
    products = {}
    with open('products.csv', newline='') as prodfile:
        for row in csv.DictReader(prodfile):
            product_code = row['product_code']
            products[product_code] = row
    with open('shipments.csv', newline='') as shipfile:
        with open('merged.csv', 'w', newline='') as outfile:
            reader = csv.DictReader(shipfile)
            writer = csv.DictWriter(outfile, OUTFIELDS, restval='N/A', quoting=csv.QUOTE_MINIMAL)
            writer.writeheader()
            for row in reader:
                product_code = row['product_code']
                row.update(products.get(product_code, {}))
                writer.writerow(row)

and use it by doing python3 merger.py to create merged.csv with following content

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

Explanation: I firstly process products.csv to get data in products dict, where keys are product_codes and values are row stored as dictionaries. Then I process shipments.csv for each row I update it either with corresponding data from products (if available) or empty dict otherwise (which does change nothing). After doing this I write it to file named merged.csv. Note that I instructed writer to fill missing values using N/A.

As you might assess this solution is more verbose than pandas merge solution, but keep in mind it does not load whole shipments.csv into memory and therefore will also work if shipments.csv is bigger than available memory.

(tested in Python 3.12.3)

Comments

3

This might work for you (GNU join):

join -t, -1 2 -2 1 -a 1 -e 'N/A' -o '1.1,0,1.3,1.4,2.2,2.3' fileA fileB 

N.B both files must be sorted by product_code.

If both files include the headers then use (GNU sed & sort):

( sed -u 1q ; sort -t, -k2,2 ) < fileA >fileA.sorted
tail -n +2 fileB | sort -t, -k1,1 -o fileB.sorted
join --header -t, -1 2 -2 1 -a 1 -e 'N/A' -o 1.1,1.2,1.3,1.4,2.2,2.3 fileA.sorted fileB.sorted

Comments

2

This is one way.

import pandas as pd

shipments = pd.read_csv('shipments.csv')  
products = pd.read_csv('products.csv')    

merged = pd.merge(shipments, products, how='left', on='product_code').fillna("N/A")

print(merged)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.