1

A common operation in pandas is something such as

In [14]: import io

In [15]: csv='''\
    ...: a,b
    ...: 1,2
    ...: 1,3
    ...: 2,3
    ...: 3,1
    ...: 3,3'''

In [16]: dt = pd.read_csv(io.StringIO(csv))

In [17]: dt
Out[17]:
   a  b
0  1  2
1  1  3
2  2  3
3  3  1
4  3  3

In [18]: dt.drop_duplicates(subset = ['a'])
Out[18]:
   a  b
0  1  2
2  2  3
3  3  1

How can this be performed in SQL though? Is there either a standard function or approach to doing what drop_duplicates(subset = <list>) does?

Edit

How pandas duplicate function works:

In [20]: dt['a'].duplicated()
Out[20]:
0    False
1     True
2    False
3    False
4     True
Name: a, dtype: bool

In [21]: dt.drop_duplicates(subset=['a'])
Out[21]:
   a  b
0  1  2
2  2  3
3  3  1
2
  • Do you care what rows remain? Commented Aug 9, 2020 at 16:57
  • @GordonLinoff typically not really - I've added in an example of how pandas select things though Commented Aug 9, 2020 at 16:59

1 Answer 1

2

I think you want:

select a, b
from (select t.*, row_number() over (partition by a order by b) as seqnum
      from t
     ) t
where seqnum = 1;

Note that SQL tables represent unordered sets, unlike dataframes. There is no "first" row unless a column specifies the ordering.

If you don't care about the rows, you can also use aggregation:

select a, min(b) as b
from t
group by a;
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.