186

I've noticed three methods of selecting a column in a Pandas DataFrame:

First method of selecting a column using loc:

df_new = df.loc[:, 'col1']

Second method - seems simpler and faster:

df_new = df['col1']

Third method - most convenient:

df_new = df.col1

Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.

I'm mostly curious as to why there appear to be three methods for doing the same thing.

6
  • 2
    Or what about df.col1? All three of these are essentially equivalent for the very simple case of selecting a column. .loc will let you do much more than select a column. Possible duplicate of stackoverflow.com/questions/31593201/… Commented Jan 23, 2018 at 19:15
  • 1
    They do the same thing for simple slices. loc is more explicit, especially when your columns are numbers. Commented Jan 23, 2018 at 19:16
  • Thanks @juanpa.arrivillaga. Good point re: df.col1, which is yet another method of column selection. I've actually looked at that other question before, several times. It's great for explaining loc and iloc. However, this question is about the other method: "df['col1']". I'm just confused as to why there are two (or three) equivalent ways of doing what appears to be the same thing. Commented Jan 23, 2018 at 19:18
  • 2
    The big disadvantage of 3rd method is that it's ambiguous when your column name is identical to an existing pandas attribute or method. E.g. you name a column 'sum'. Then if you type df.sum, what happens? (spoiler alert, nothing useful, although df.sum() still works luckily) So 3rd way should be seen as a shortcut that is fine, but need to be careful with Commented Jan 23, 2018 at 19:33
  • 1
    A decent explanation here stackoverflow.com/questions/38886080/… Commented Jan 23, 2018 at 19:34

5 Answers 5

190

In the following situations, they behave the same:

  1. Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
  2. Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
  3. Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)

However, [] does not work in the following situations:

  1. You can select a single row with df.loc[row_label]
  2. You can select a list of rows with df.loc[[row_label1, row_label2]]
  3. You can slice columns with df.loc[:, 'A':'C']

These three cannot be done with []. More importantly, if your selection involves both rows and columns, then assignment becomes problematic.

df[1:3]['A'] = 5

This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:

df.loc[1:3, 'A'] = 5

With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).

Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.


Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.

Sign up to request clarification or add additional context in comments.

3 Comments

What do you mean with "the returning object might be a copy"? It is a bit confusing. Should I expect the value returned by df[1:3]['A'] = 5 to be a copy or not?
@AlessioF That's the problem. We don't really know. pandas makes no guarantees about what returns from df.__getitem__(...), and under the hood, the memory layout of the stored array can result in a view or a copy. In general, when you work on a dataframe with a single dtype, you get a view but that's not guaranteed. I believe they are working on a new approach instead of using BlockManager which is the main source of these issues.
And df.loc[] fails when the selection isn't found.
12

loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:

df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']

However [] is intended to get columns with particular names:

df['Price']

With [] you can also filter rows, but it is more elaborated:

df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']

2 Comments

I have df_augmented['date_of_birth'] = pd.to_datetime(df_augmented.date_of_birth, format='mixed') working, but not df_augmented.loc[:, 'date_of_birth'] = pd.to_datetime(df_augmented.date_of_birth, format='mixed') - the latter gives an Object column, the former a datetime64[ns]; why? are there differences between df['col'] and df[:, 'col'] in this case?
Looks like it is a Pandas peculiarity after 2.0.x; found an answer here. Confused me quite a bit!
4

If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:

  • When selecting subsets of data, square brackets [] are used.

  • Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

  • Select specific rows and/or columns using loc when using the row and column names

  • Select specific rows and/or columns using iloc when using the positions in the table

  • You can assign new values to a selection based on loc/iloc.

I highlighted some of the points to make their use-case differences even more clear.

1 Comment

What if you need to mix both methods, positions and names, for example the first row of the columna 'A'?
0

There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.

You can refer to this question: Is there a nice way to generate multiple columns using .loc?

Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

1 Comment

Both methods worked when I tried, at least in the last Pandas version (1.5.3). see the screenshot
0

There is an important difference in assignment. These two assignments are not equivalent:

df['j'] = pd.Series(...)
df.loc[:,'j'] = pd.Series(...)
  • Indexing with [...] replaces the series. The new series comes with its own dtype, the existing dtype is discarded.

  • Indexing with loc[...] replaces values, but reuses the existing series and its dtype, upcasting may occur to fit new values.

See how the old int32 is ignored when using [...]:

import pandas as pd
import numpy.random as npr

n = 4

# NumPy returns dtype int32
df = pd.DataFrame({
    'j': npr.randint(1, 10, n),
    'k': npr.randint(1, 10, n)})
print(df.dtypes)

# uint8 series
s = pd.Series(npr.randint(1, 10, n), dtype='uint8')

# Using [...]: uint8 series replaces uint32 series
df2 = df.copy()
df2['j'] = s
print(df2.dtypes)

# Using loc[...]: uint8 data upcasted to existing uint32
df3 = df.copy()
df3.loc[:,'j'] = s
print(df3.dtypes)

j    int32        ⇠ original dtype
k    int32
dtype: object

j    uint8        ⇠ with [...]
k    int32
dtype: object

j    int32        ⇠ with loc[...]
k    int32
dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.