0

When reading a CSV file as a pandas dataframe, an error is raised when trying to select a subset of columns based on original column names (usecols=) and renaming the selected columns (names=). Passing renamed column names to usecols works, but all columns must be passed to names to correctly select columns.

# read the entire CSV
df1a = pd.read_csv(folder_csv+'test_read_csv.csv')
# select a subset of columns while reading the CSV
df1b = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['Col1','Col3'])
# rename columns while reading the CSV
df1c = pd.read_csv(folder_csv+'test_read_csv.csv', names=['first', 'second', 'third'], header=0)

# select a subset of columns and rename them while reading the CSV;
# throws error "ValueError: Usecols do not match columns, columns expected but not found: ['Col3', 'Col1']"
df1d = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['Col1','Col3'], names=['first','third'])

# selects columns 1 and 2, calling them 1 and 3
df1e = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['first','third'], names=['first','third'])
# selects columns 1 and 3 correctly
df1f = pd.read_csv(folder_csv+'test_read_csv.csv', usecols=['first','third'], names=['first','second','third'])

The CSV file test_read_csv.csv is:

Col1,Col2,Col3
val1a,val2a,val3a
val1b,val2b,val3b
val1c,val2c,val3c
val1d,val2d,val3d
val1e,val2e,val3e

Wouldn't it be a fairly common use case to select certain columns based on the original column names and then renaming only those columns while reading the data?

Of course, it is possible to select the columns and rename them after loading the entire CSV file:

df1 = df1[['Col1','Col3']]
df1.columns = ['first', 'third']

But I don't know how and whether this can be integrated directly when reading the data. The same holds also for pd.read_excel().

2
  • This question is similar to: How can I change some, but not all, column names when using pd.read_excel?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented Jun 28, 2024 at 9:30
  • @BendingRodriguez: One difference to that question is the usage of the usecols arg. While without usecols I do understand that all columns must be renamed if the columns to be renamed are not provided (after all, names is a list and not a dictionary like the one used in df.rename(columns=)), having columns selected with usecols should make it clear which columns should be selected, provided the renaming is done after the selection. Commented Jun 28, 2024 at 9:34

3 Answers 3

1

I agree with you but unfortunately this is how read_csv works at the moment: the names you're passing insert a new header and the usecols are validated (at this stage of the process) based on this "new" one :

def _validate_usecols_names(self, usecols: SequenceT, names: Sequence) -> SequenceT:
    missing = [c for c in usecols if c not in names]
    if len(missing) > 0:
        raise ValueError(
            f"Usecols do not match columns, columns expected but not found: "
            f"{missing}"
        )

    return usecols

Basically, pandas select the usecols from the names and not set the latter as new values of the former. A workaround (which might not satisfy you) is to use integer-indices:

df1d = pd.read_csv(
    folder_csv + "test_read_csv.csv",
    header=0,                 # at first line
    names=["first", "third"], # insert this header (=override)
    usecols=[0, 2],           # and select 1st/3rd columns
)

print(df1d)

#    first  third
# 0  val1a  val3a
# 1  val1b  val3b
# 2  val1c  val3c
# 3  val1d  val3d
# 4  val1e  val3e
Sign up to request clarification or add additional context in comments.

10 Comments

This is certainly a possible workaround. Of course, in this case you need to know the column indices corresponding to the labels, but I don't think it is a big issue.
What I do not exactly understand, however, is why your example works: The way I'd understand it is that pandas first assigns new names to the columns and then selects the columns using usecols. However, doesn't it assign 'third' only to the second column so that the third one should still keep its original label 'Col3'?
When you pass a list of names that has less elements than the original header like withpd.read_csv(folder_csv+'test_read_csv.csv', names=['first', 'third']) (i.e, 2 Vs. 3), pandas assign the names to the rightmost columns (here "second" and "third") of the csv.
To be precise: You mean that the last element of names is assigned to all columns beyond (i.e. to the right of) the corresponding column index (here Col2 and Col3), correct? As opposed to pandas assigning the given labels to the rightmost columns (one could interpret your reply as such).
Well, I think you did not :p. The original header is not a header anymore but the 1st row.
|
1

As @Timeless stated, you can't do it directly from read_csv(), but you can use rename as follows:

>>> cols = {"Col1": "first", "Col3": "third"}
>>> pd.read_csv("test_read_csv.csv", usecols=cols.keys()).rename(columns=cols)
   first  third
0  val1a  val3a
1  val1b  val3b
2  val1c  val3c
3  val1d  val3d
4  val1e  val3e

Comments

1

I think you will get, according the doc:

import pandas as pd

df = pd.read_csv("test.csv", usecols=[0,2], header=0, names=['first','third'])
print(df)

Output:

  first third
0   val1a   val3a
1   val1b   val3b
2   val1c   val3c
3   val1d   val3d
4   val1e   val3e

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.