7

I have a dataframe that has many rows per combination of the 'PROGRAM', 'VERSION', and 'RELEASE_DATE' columns. I want to get a dataframe with all of the combinations of just those three columns.

Would this be a job for groupby or distinct?

2 Answers 2

5

Since you are not aggregating anything, use unique

df.select('PROGRAM','VERSION','RELEASE_DATE').unique()
Sign up to request clarification or add additional context in comments.

3 Comments

Can the select version posted above be iterated on?
for prog, vers, rel in df.select(['PROGRAM','VERSION','RELEASE_DATE']).distinct().rows(): ...
distinct() has been deprecated, you should use unique() instead
5

pl.DataFrame.unique has a subset parameter to specify the columns to consider when identifying duplicate rows.

df.unique(subset=["PROGRAM", "VERSION", "RELEASE_DATE"])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.