How to create drop_duplicates in a SQL query? [duplicate]

Question

A common operation in pandas is something such as

In [14]: import io

In [15]: csv='''\
    ...: a,b
    ...: 1,2
    ...: 1,3
    ...: 2,3
    ...: 3,1
    ...: 3,3'''

In [16]: dt = pd.read_csv(io.StringIO(csv))

In [17]: dt
Out[17]:
   a  b
0  1  2
1  1  3
2  2  3
3  3  1
4  3  3

In [18]: dt.drop_duplicates(subset = ['a'])
Out[18]:
   a  b
0  1  2
2  2  3
3  3  1

How can this be performed in SQL though? Is there either a standard function or approach to doing what drop_duplicates(subset = <list>) does?

Edit

How pandas duplicate function works:

In [20]: dt['a'].duplicated()
Out[20]:
0    False
1     True
2    False
3    False
4     True
Name: a, dtype: bool

In [21]: dt.drop_duplicates(subset=['a'])
Out[21]:
   a  b
0  1  2
2  2  3
3  3  1

@GordonLinoff typically not really - I've added in an example of how pandas select things though — baxx
– baxx, Commented Aug 9, 2020 at 16:59

Gordon Linoff · Accepted Answer · 2020-08-09 17:09:16Z

2

I think you want:

select a, b
from (select t.*, row_number() over (partition by a order by b) as seqnum
      from t
     ) t
where seqnum = 1;

Note that SQL tables represent unordered sets, unlike dataframes. There is no "first" row unless a column specifies the ordering.

If you don't care about the rows, you can also use aggregation:

select a, min(b) as b
from t
group by a;

edited Aug 9, 2020 at 17:09

answered Aug 9, 2020 at 16:57

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create drop_duplicates in a SQL query? [duplicate]

Edit

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Edit

1 Answer 1

Comments

Linked

Related