How to merge two CSV files based on matching values in different columns and keep unmatched rows with placeholders?

Question

I'm working on a data cleaning task and could use some help. I have two CSV files with thousands of rows each:

File A contains product shipment records. File B contains product descriptions and categories. Here’s a simplified example:

File A (shipments.csv):

shipment_id,product_code,quantity,date
S001,P123,10,2025-07-01
S002,P456,5,2025-07-02
S003,P789,8,2025-07-03

File B (products.csv):

product_code,description,category
P123,Widget A,Tools
P456,Widget B,Hardware

I want to create a merged file where each row from File A is enriched with the matching product description and category from File B (based on product_code). If there's no match, I’d like to keep the row from File A and fill the missing columns with "N/A".

Expected Output:

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

I tried using pandas.merge() in Python but it drops unmatched rows unless I use how='left', and I’m not sure how to fill missing values properly.

Any help? Thanks in advance!

pmf · Accepted Answer · 2025-07-31 04:42:11Z

7

As tagged, using awk (note the order of the files):

awk -F , -v OFS=, '
  NR == FNR {a[$1] = $2 OFS $3; next}
  {print $0, ($2 in a? a[$2]: "N/A" OFS "N/A")}
' products.csv shipments.csv

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

edited Jul 31 at 4:42

answered Jul 31 at 4:06

pmf

38.3k3 gold badges31 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user21677098 Jul 31 at 4:12

Thanks for the helpful awk solution! It works well for quick tasks. One small issue: the output includes the product_code column twice. Ideally, I’d like to append only the description and category fields from products.csv, not the entire line. Also, since I'm working with large datasets and doing more data cleaning, I might stick with pandas for better scalability. Appreciate your input!

pmf Jul 31 at 4:42

You were right, fixed it.

the busybee Jul 31 at 5:29

It even merges the column header row.

Ranger · Accepted Answer · 2025-08-08 06:42:15Z

4

Yor way is correct, first you should merge two dfs, based on the product_code , then fill missing values with "N\A". Because when the left dataframe does not find its match on the right dataframe, it automatically puts NaN to those rows.

import pandas as pd
shipments= [
    ["S001", "P123", 10, "2025-07-01"],
    ["S002", "P456", 5, "2025-07-02"],
    ["S003", "P789", 8, "2025-07-03"]
]

columns = ["shipment_id", "product_code", "quantity", "date"]
df_shipments = pd.DataFrame(shipments, columns=columns)

products = [
    ["P123", "Widget A", "Tools"],
    ["P456", "Widget B", "Hardware"]
]

columns = ["product_code", "description", "category"]

df_products = pd.DataFrame(products, columns=columns)

merged_df = pd.merge(df_shipments, df_products,how = 'left').fillna("N/A")

At the last line of the code, it is merging as you said and then we are using fillna to fill those missing values.

Additional: Since you have product_code on both dataframes, it automatically matches those columns. But if you have more than one column that have same name on both dataframes, then it try to match both of them. So if you want to stick to the selected column, then better use left_on and right_on .

Moreover, if you want only category from the right dataframe, you can use something like:

merged_df = pd.merge(df_shipments, df_products[['product_code', 'category']], how = 'left').fillna("N/A")

This could be better as it does not alter the original dataframe of df_products .

edited Aug 8 at 6:42

answered Jul 31 at 4:14

Ranger

3941 silver badge9 bronze badges

2 Comments

user21677098 Jul 31 at 4:20

Thanks! This approach using pandas is exactly what I was aiming for. Just a small correction: the variable df in the merge line should be df_shipments, otherwise it throws an error.

Ranger Jul 31 at 4:24

Fixed! If so do not forget to accept it as answer.

Daweo · Accepted Answer · 2025-07-31 10:58:07Z

I propose following python solution using csv which is part of standard library, let shipments.csv content be

shipment_id,product_code,quantity,date
S001,P123,10,2025-07-01
S002,P456,5,2025-07-02
S003,P789,8,2025-07-03

and products.csv content be

product_code,description,category
P123,Widget A,Tools
P456,Widget B,Hardware

then create file named merger.py with following content

import csv

OUTFIELDS = ['shipment_id', 'product_code', 'quantity', 'date', 'description', 'category']

if __name__ == '__main__':
    products = {}
    with open('products.csv', newline='') as prodfile:
        for row in csv.DictReader(prodfile):
            product_code = row['product_code']
            products[product_code] = row
    with open('shipments.csv', newline='') as shipfile:
        with open('merged.csv', 'w', newline='') as outfile:
            reader = csv.DictReader(shipfile)
            writer = csv.DictWriter(outfile, OUTFIELDS, restval='N/A', quoting=csv.QUOTE_MINIMAL)
            writer.writeheader()
            for row in reader:
                product_code = row['product_code']
                row.update(products.get(product_code, {}))
                writer.writerow(row)

and use it by doing python3 merger.py to create merged.csv with following content

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

Explanation: I firstly process products.csv to get data in products dict, where keys are product_codes and values are row stored as dictionaries. Then I process shipments.csv for each row I update it either with corresponding data from products (if available) or empty dict otherwise (which does change nothing). After doing this I write it to file named merged.csv. Note that I instructed writer to fill missing values using N/A.

As you might assess this solution is more verbose than pandas merge solution, but keep in mind it does not load whole shipments.csv into memory and therefore will also work if shipments.csv is bigger than available memory.

(tested in Python 3.12.3)

potong · Accepted Answer · 2025-08-03 10:05:47Z

3

This might work for you (GNU join):

join -t, -1 2 -2 1 -a 1 -e 'N/A' -o '1.1,0,1.3,1.4,2.2,2.3' fileA fileB

N.B both files must be sorted by product_code.

If both files include the headers then use (GNU sed & sort):

( sed -u 1q ; sort -t, -k2,2 ) < fileA >fileA.sorted
tail -n +2 fileB | sort -t, -k1,1 -o fileB.sorted
join --header -t, -1 2 -2 1 -a 1 -e 'N/A' -o 1.1,1.2,1.3,1.4,2.2,2.3 fileA.sorted fileB.sorted

edited Aug 3 at 10:05

answered Jul 31 at 6:38

potong

59.3k6 gold badges55 silver badges92 bronze badges

Comments

nasrin begum pathan · Accepted Answer · 2025-07-31 04:35:42Z

2

This is one way.

import pandas as pd

shipments = pd.read_csv('shipments.csv')  
products = pd.read_csv('products.csv')    

merged = pd.merge(shipments, products, how='left', on='product_code').fillna("N/A")

print(merged)

answered Jul 31 at 4:35

nasrin begum pathan

1052 silver badges10 bronze badges

Collectives™ on Stack Overflow

How to merge two CSV files based on matching values in different columns and keep unmatched rows with placeholders?

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related