How to remove duplicate rows from a CSV file?

Question

Is there a simple way to find and remove duplicate rows from a CSV file?

Sample test.csv file:

row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh

Expected results:

row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh

Where can I start to accomplish this with PHP?

Is it true that all duplicate lines really do appear consecutively? — Cups
– Cups, Commented Dec 28, 2012 at 15:42

newfurniturey · Accepted Answer · 2012-12-28 16:42:32Z

13

A straight-to-the point method would be to read the file in line-by-line and keep track of each row you've previously seen. If the current row has already been seen, skip it.

Something like the following (untested) code may work:

<?php
// array to hold all "seen" lines
$lines = array();

// open the csv file
if (($handle = fopen("test.csv", "r")) !== false) {
    // read each line into an array
    while (($data = fgetcsv($handle, 8192, ",")) !== false) {
        // build a "line" from the parsed data
        $line = join(",", $data);

        // if the line has been seen, skip it
        if (isset($lines[$line])) continue;

        // save the line
        $lines[$line] = true;
    }
    fclose($handle);
}

// build the new content-data
$contents = '';
foreach ($lines as $line => $bool) $contents .= $line . "\r\n";

// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>

This code uses fgetcsv() and uses a ~~space~~ comma as your column-delimiter (based on the sample-data in your ~~question~~ comment).

Storing every line that has been seen, as above, will assure to remove all duplicate lines in the file regardless of whether-or-not they're directly following one another or not. If they're always going to be back-to-back, a more simple method (and more memory conscious) would be to store only the last-seen line and then compare against the current one.

UPDATE (duplicate lines via the SKU-column, not full-line)
Based on sample data provided in a comment, the "duplicate lines" aren't actually equal (though they are similar, they differ by a good number of columns). The similarity between them can be linked to a single column, the sku.

The following is an expanded version of the above code. This block will parse the first line (column-list) of the CSV file to determine which column contains the sku code. From there, it will keep a unique list of SKU codes seen and if the current line has a "new" code, it will write that line to the new "unique" file using fputcsv():

<?php
// array to hold all unique lines
$lines = array();

// array to hold all unique SKU codes
$skus = array();

// index of the `sku` column
$skuIndex = -1;

// open the "save-file"
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {
    // open the csv file
    if (($readHandle = fopen("test.csv", "r")) !== false) {
        // read each line into an array
        while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {
            if ($skuIndex == -1) {
                // we need to determine what column the "sku" is; this will identify
                // the "unique" rows
                foreach ($data as $index => $column) {
                    if ($column == 'sku') {
                        $skuIndex = $index;
                        break;
                    }
                }
                if ($skuIndex == -1) {
                    echo "Couldn't determine the SKU-column.";
                    die();
                }
                // write this line to the file
                fputcsv($saveHandle, $data);
            }

            // if the sku has been seen, skip it
            if (isset($skus[$data[$skuIndex]])) continue;
            $skus[$data[$skuIndex]] = true;

            // write this line to the file
            fputcsv($saveHandle, $data);
        }
        fclose($readHandle);
    }
    fclose($saveHandle);
}
?>

Overall, this method is far-more memory friendly as it doesn't need to save a copy of every line in memory (only the SKU codes).

edited Dec 28, 2012 at 16:42

answered Dec 28, 2012 at 15:37

newfurniturey

38.6k10 gold badges99 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Arnold Daniels Over a year ago

Duplicates should only be removed if the row is repeated. If the same row comes back later in the CSV it should be included.

newfurniturey Over a year ago

@ArnoldDaniels I see no such statement in the post regarding that. Please let me know where you received that information from and I can update my answer accordingly.

Arnold Daniels Over a year ago

Take a look at the 'Expected results'. You can see the non-unique rows.

user1932607 Over a year ago

i am sorry, after run the code, when open the test_unique.csv. the data are mixed and disorderly

newfurniturey Over a year ago

@user1932607 Unordered? It will put the contents back into the file in the same order it read it in. Is the answer supposed to, in addition to removing duplicates, order the data as well?

|

magento4u_com · Accepted Answer · 2020-09-04 10:25:01Z

0

One line solution:

file_put_contents('newdata.csv', array_unique(file('data.csv')));

answered Sep 4, 2020 at 10:25

magento4u_com

3642 silver badges6 bronze badges

1 Comment

Toma Tomov Over a year ago

Did you test this ?

Collectives™ on Stack Overflow

How to remove duplicate rows from a CSV file?

2 Answers 2

11 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related