2

I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.

This is the function that works on a 1D array:

    def findB(x,y):

        y_int = np.zeros(y.size)
        y_int_min = np.zeros(y.size)
        y_int_max = np.zeros(y.size)
        end = y.size-1

       y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])

       for i in range(1,end,1):
            j=i+1
            y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]

       return y_int

I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:

    B_df = y_df.applymap(integrator)  

EDIT:

Starting dataframe dB_df:

        Sample1 1 dB    Sample1 2 dB    Sample1 3 dB    Sample1 4 dB Sample1 5 dB   Sample1 6 dB
    0   2.472389    6.524537    0.306852    -6.209527   -6.531123   -4.901795
    1   6.982619    -0.534953   -7.537024   8.301643    7.744730    7.962163
    2   -8.038405   -8.888681   6.856490    -0.052084   0.018511    -4.117407
    3   0.040788    5.622489    3.522841    -8.170495   -7.707704   -6.313693
    4   8.512173    1.896649    -8.831261   6.889746    6.960343    8.236696
    5   -6.234313   -9.908385   4.934738    1.595130    3.116842    -2.078000
    6   -1.998620   3.818398    5.444592    -7.503763   -8.727408   -8.117782
    7   7.884663    3.818398    -8.046873   6.223019    4.646397    6.667921
    8   -5.332267   -9.163214   1.993285    2.144201    4.646397    0.000627
    9   -2.783008   2.288842    5.836786    -8.013618   -7.825365   -8.470759

Ending dataframe B_df:

        Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
    0   0.000038    0.000024    -0.000029   0.000008    0.000005    0.000012
    1   0.000034    -0.000014   -0.000032   0.000041    0.000036    0.000028
    2   0.000002    -0.000027   0.000010    0.000008    0.000005    -0.000014
    3   0.000036    0.000003    -0.000011   0.000003    0.000002    -0.000006
    4   0.000045    -0.000029   -0.000027   0.000037    0.000042    0.000018
    5   0.000012    -0.000053   0.000015    0.000014    0.000020    -0.000023
    6   0.000036    -0.000023   0.000004    0.000009    0.000004    -0.000028
    7   0.000046    -0.000044   -0.000020   0.000042    0.000041    -0.000002
    8   0.000013    -0.000071   0.000011    0.000019    0.000028    -0.000036
    9   0.000000    0.000000    0.000000    0.000000    0.000000    0.000000

In the above example,

    (x[j]-x[i]) = 0.000008
11
  • 1
    Can you give a example of your input Dataframe and your expected output? Commented May 10, 2017 at 19:04
  • You are looking for apply probably, but this really won't be any more efficient than a loop over the columns. Commented May 10, 2017 at 19:06
  • Where is x coming from? Is it a Series, a numpy ndarray, or something else? Commented May 10, 2017 at 19:50
  • x comes from another array, but the ultimately (x[j]-x[i]) is a constant value of 0.000008 for all i and j. @Mad Physicist Commented May 10, 2017 at 19:57
  • What is the type of x? That is much more important than the numerical value. Commented May 10, 2017 at 19:57

2 Answers 2

2

First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like

def findB(x, y):
    """
    x : pandas.Series
    y : pandas.DataFrame
    """
    mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
    delta_x = x.shift(-1)[:-1] - x[:-1]
    scaled_int = mean_y.multiply(delta_x)
    cumulative_int = scaled_int.cumsum(axis='index')
    return cumulative_int.shift(1).fillna(0)

Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.

Sign up to request clarification or add additional context in comments.

Comments

0

Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.

import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.