5

I am trying to use SciPy's scipy.optimize.minimize function to minimize a function I have created. However, the function I am trying to optimize over is itself constructed from other functions which perform calculations based on a pandas DataFrame.

I understand that SciPy's minimize function can input multiple arguments via a tuple (e.g., Structure of inputs to scipy minimize function). However, I do not know how to pass in a function which relies on a pandas DataFrame.

I have created a reproducible example below.

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize


####################     Data     ####################
# Initialize dataframe. 
data = pd.DataFrame({'id_i': ['AA', 'BB', 'CC', 'XX', 'DD'], 
                     'id_j': ['ZZ', 'YY', 'XX', 'BB', 'AA'], 
                     'y': [0.30, 0.60, 0.70, 0.45, 0.65], 
                     'num': [1000, 2000, 1500, 1200, 1700], 
                     'bar': [-4.0, -6.5, 1.0, -3.0, -5.5], 
                     'mu': [-4.261140, -5.929608, 1.546283, -1.810941, -3.186412]})

data['foo_1'] = data['bar'] - 11 * norm.ppf(1/1.9)
data['foo_2'] = data['bar'] - 11 * norm.ppf(1 - (1/1.9))

# Store list of ids.
id_list = sorted(pd.unique(pd.concat([data['id_i'], data['id_j']], axis=0)))


####################     Functions     ####################
# Function 1: Intermediate calculation to calculate predicted values.
def calculate_y_pred(row, delta_params, sigma_param, id_list):

    # Extract the relevant values from delta_params.
    delta_i = delta_params[id_list.index(row['id_i'])]
    delta_j = delta_params[id_list.index(row['id_j'])]

    # Calculate adjusted version of mu. 
    mu_adj = row['mu'] - delta_i + delta_j

    # Calculate predicted value of y.
    y_pred = norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) / \
                (norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) + 
                    (1 - norm.cdf(row['foo_2'], loc=mu_adj, scale=sigma_param)))

    return y_pred

# Function to calculate the log-likelihood (for a row of DataFrame data).
def loglik_row(row, delta_params, sigma_param, id_list):

    # Calculate the log-likelihood for this row.
    y_pred = calculate_y_pred(row, delta_params, sigma_param, id_list)
    y_obs = row['y']
    n = row['num']
    loglik_row = np.log(norm.pdf(((y_obs - y_pred) * np.sqrt(n)) / np.sqrt(y_pred * (1-y_pred))) / 
                            np.sqrt(y_pred * (1-y_pred) / n))

    return loglik_row

# Function to calculate the sum of the negative log-likelihood. 
# This function is called via SciPy's minimize function.
def loglik_total(data, id_list, params):

    # Extract parameters.
    delta_params = list(params[0:len(id_list)])
    sigma_param = init_params[-1]

    # Calculate the negative log-likelihood for every row in data and sum the values.
    loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )

    return loglik_total


####################     Optimize     ####################
# Provide initial parameter guesses. 
delta_params = [0 for id in id_list]
sigma_param = 11
init_params = tuple(delta_params + [sigma_param])

# Maximize the log likelihood (minimize the negative log likelihood). 
minimize(fun=loglik_total, x0=init_params, 
            args=(data, id_list), method='nelder-mead')

This results in the following error: AttributeError: 'numpy.ndarray' object has no attribute 'apply' (the entire error output is below). I believe this error is because minimize is treating X as a numpy array, whereas I would like to pass it as a pandas DataFrame.

AttributeError: 'numpy.ndarray' object has no attribute 'apply'
AttributeErrorTraceback (most recent call last)
<ipython-input-93-9a5866bd626e> in <module>()
      1 minimize(fun=loglik_total, x0=init_params, 
----> 2             args=(data, id_list), method='nelder-mead')
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.pyc in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    436                       callback=callback, **options)
    437     elif meth == 'nelder-mead':
--> 438         return _minimize_neldermead(fun, x0, args, callback, **options)
    439     elif meth == 'powell':
    440         return _minimize_powell(fun, x0, args, callback, **options)
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in _minimize_neldermead(func, x0, args, callback, maxiter, maxfev, disp, return_all, initial_simplex, xatol, fatol, **unknown_options)
    515 
    516     for k in range(N + 1):
--> 517         fsim[k] = func(sim[k])
    518 
    519     ind = numpy.argsort(fsim)
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in function_wrapper(*wrapper_args)
    290     def function_wrapper(*wrapper_args):
    291         ncalls[0] += 1
--> 292         return function(*(wrapper_args + args))
    293 
    294     return ncalls, function_wrapper
<ipython-input-69-546e169fc54e> in loglik_total(data, id_list, params)
      6 
      7     # Calculate the negative log-likelihood for every row in data and sum the values.
----> 8     loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )
      9 
     10     return loglik_total
AttributeError: 'numpy.ndarray' object has no attribute 'apply'

What would be the proper way to handle the DataFrame data and call my function loglik_total within SciPy's minimize function? Any suggestions are welcome and would be appreciated.

Possible Solution: Note, I have considered that I could edit my functions to treat data as a numpy array rather than a pandas DataFrame. However, I would like to avoid this if possible for a couple reasons: 1) in loglik_total, I use pandas' apply function to apply the loglik_row function to every row of data; 2) it is convenient to refer to columns of data by their column names rather than numerical indices.

3
  • Cannot reproduce the error; I receive KeyError: ('id_i', u'occurred at index 0') Commented Apr 5, 2017 at 19:21
  • 1
    @Cleb I apologize -- you got that error because I had accidentally included an extra line data = pd.DataFrame(data) in the loglik_total function (I included that while exploring options of explicitly converting data from a numpy array to a pandas DataFrame). I have removed that line and you should now be able to reproduce the error displayed in the original post. Commented Apr 5, 2017 at 20:15
  • 1
    Ok, I think I found the issue; please check the answer below. Commented Apr 5, 2017 at 20:44

1 Answer 1

3

It was not an issue with the data format but you called loglik_total in the wrong manner. Here is the modified version, with the correct order of arguments (params has to go first; then you pass the additional arguments in the same order as in args of your minimize call):

def loglik_total(params, data, id_list):

    # Extract parameters.
    delta_params = list(params[0:len(id_list)])
    sigma_param = params[-1]

    # Calculate the negative log-likelihood for every row in data and sum the values.
    lt = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )

    return lt

If you then call

res = minimize(fun=loglik_total, x0=init_params,
            args=(data, id_list), method='nelder-mead')

it runs through nicely (note that the order is x, data, id_list, the same as you pass to loglik_total) and res looks as follows:

final_simplex: (array([[  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09],
       [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
          6.43380010e+04,  -1.85436851e+09]]), array([-0., -0., -0., -0., -0., -0., -0., -0., -0.]))
           fun: -0.0
       message: 'Optimization terminated successfully.'
          nfev: 930
           nit: 377
        status: 0
       success: True
             x: array([  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,
         3.88586258e+05,   3.19488400e+05,   4.90209168e+04,
         6.43380010e+04,  -1.85436851e+09])

Whether this output makes any sense, I cannot judge though :)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.