1

I have a dataset which looks like this:

   val
   1
   1
   3
   4
   6
   6
   9
   ...

I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:

   val   occurrences
   1     2
   3     1
   4     1
   6     2
   9     1
   ...

and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.

My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?

For example:

1,1,3,4,6,6,9

would be:

df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})

Median is 4. I'm looking for a method to extract median directly from given df.

2
  • 2
    df.val.value_counts().reset_index() Commented Sep 18, 2018 at 15:45
  • Could you please elaborate what is your input and output dataframe. Additionally you can refer pandas.DataFrame.mean and pandas.DataFrame.boxplot Commented Sep 18, 2018 at 16:32

1 Answer 1

1

No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.

If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.