Pandas read csv simultaneously passing usecols and names args

Question

When reading a CSV file as a pandas dataframe, an error is raised when trying to select a subset of columns based on original column names (usecols=) and renaming the selected columns (names=). Passing renamed column names to usecols works, but all columns must be passed to names to correctly select columns.

# read the entire CSV
df1a = pd.read_csv(folder_csv+'test_read_csv.csv')
# select a subset of columns while reading the CSV
df1b = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['Col1','Col3'])
# rename columns while reading the CSV
df1c = pd.read_csv(folder_csv+'test_read_csv.csv', names=['first', 'second', 'third'], header=0)

# select a subset of columns and rename them while reading the CSV;
# throws error "ValueError: Usecols do not match columns, columns expected but not found: ['Col3', 'Col1']"
df1d = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['Col1','Col3'], names=['first','third'])

# selects columns 1 and 2, calling them 1 and 3
df1e = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['first','third'], names=['first','third'])
# selects columns 1 and 3 correctly
df1f = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['first','third'], names=['first','second','third'])

The CSV file test_read_csv.csv is:

Col1,Col2,Col3
val1a,val2a,val3a
val1b,val2b,val3b
val1c,val2c,val3c
val1d,val2d,val3d
val1e,val2e,val3e

Wouldn't it be a fairly common use case to select certain columns based on the original column names and then renaming only those columns while reading the data?

Of course, it is possible to select the columns and rename them after loading the entire CSV file:

df1 = df1[['Col1','Col3']]
df1.columns = ['first', 'third']

But I don't know how and whether this can be integrated directly when reading the data. The same holds also for pd.read_excel().

This question is similar to: How can I change some, but not all, column names when using pd.read_excel?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. — Bending Rodriguez
– Bending Rodriguez, Commented Jun 28, 2024 at 9:30
@BendingRodriguez: One difference to that question is the usage of the usecols arg. While without usecols I do understand that all columns must be renamed if the columns to be renamed are not provided (after all, names is a list and not a dictionary like the one used in df.rename(columns=)), having columns selected with usecols should make it clear which columns should be selected, provided the renaming is done after the selection. — silence_of_the_lambdas
– silence_of_the_lambdas, Commented Jun 28, 2024 at 9:34

Timeless · Accepted Answer · 2024-06-28 10:07:36Z

1

I agree with you but unfortunately this is how read_csv works at the moment: the names you're passing insert a new header and the usecols are validated (at this stage of the process) based on this "new" one :

def _validate_usecols_names(self, usecols: SequenceT, names: Sequence) -> SequenceT:
    missing = [c for c in usecols if c not in names]
    if len(missing) > 0:
        raise ValueError(
            f"Usecols do not match columns, columns expected but not found: "
            f"{missing}"
        )

    return usecols

Basically, pandas select the usecols from the names and not set the latter as new values of the former. A workaround (which might not satisfy you) is to use integer-indices:

df1d = pd.read_csv(
    folder_csv + "test_read_csv.csv",
    header=0,                 # at first line
    names=["first", "third"], # insert this header (=override)
    usecols=[0, 2],           # and select 1st/3rd columns
)

print(df1d)

#    first  third
# 0  val1a  val3a
# 1  val1b  val3b
# 2  val1c  val3c
# 3  val1d  val3d
# 4  val1e  val3e

edited Jun 28, 2024 at 10:07

answered Jun 28, 2024 at 9:49

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

silence_of_the_lambdas Over a year ago

This is certainly a possible workaround. Of course, in this case you need to know the column indices corresponding to the labels, but I don't think it is a big issue.

silence_of_the_lambdas Over a year ago

What I do not exactly understand, however, is why your example works: The way I'd understand it is that pandas first assigns new names to the columns and then selects the columns using usecols. However, doesn't it assign 'third' only to the second column so that the third one should still keep its original label 'Col3'?

Timeless Over a year ago

When you pass a list of names that has less elements than the original header like withpd.read_csv(folder_csv+'test_read_csv.csv', names=['first', 'third']) (i.e, 2 Vs. 3), pandas assign the names to the rightmost columns (here "second" and "third") of the csv.

silence_of_the_lambdas Over a year ago

To be precise: You mean that the last element of names is assigned to all columns beyond (i.e. to the right of) the corresponding column index (here Col2 and Col3), correct? As opposed to pandas assigning the given labels to the rightmost columns (one could interpret your reply as such).

Timeless Over a year ago

Well, I think you did not :p. The original header is not a header anymore but the 1st row.

|

paime · Accepted Answer · 2024-06-28 12:02:00Z

1

As @Timeless stated, you can't do it directly from read_csv(), but you can use rename as follows:

>>> cols = {"Col1": "first", "Col3": "third"}
>>> pd.read_csv("test_read_csv.csv", usecols=cols.keys()).rename(columns=cols)
   first  third
0  val1a  val3a
1  val1b  val3b
2  val1c  val3c
3  val1d  val3d
4  val1e  val3e

answered Jun 28, 2024 at 12:02

paime

3,5721 gold badge9 silver badges26 bronze badges

Comments

Hermann12 · Accepted Answer · 2024-06-28 12:15:53Z

1

I think you will get, according the doc:

import pandas as pd

df = pd.read_csv("test.csv", usecols=[0,2], header=0, names=['first','third'])
print(df)

Output:

  first third
0   val1a   val3a
1   val1b   val3b
2   val1c   val3c
3   val1d   val3d
4   val1e   val3e

edited Jun 28, 2024 at 12:15

answered Jun 28, 2024 at 9:49

Hermann12

4,1282 gold badges8 silver badges21 bronze badges

Collectives™ on Stack Overflow

Pandas read csv simultaneously passing usecols and names args

3 Answers 3

10 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related