1

I have a dataset the sample structure of which looks like this:

SV,Arizona,618,264,63,923
SV,Arizona,367,268,94,138
SV,Arizona,421,268,121,178
SV,Arizona,467,268,171,250
SV,Arizona,298,270,62,924
SV,Arizona,251,272,93,138
SV,Arizona,215,276,120,178
SV,Arizona,222,279,169,250
SV,Arizona,246,279,64,94
SV,Arizona,181,281,97,141
SV,Arizona,197,286,125.01,182
SV,Arizona,178,288,175.94,256
SV,California,492,208,63,923
SV,California,333,210,94,138
SV,California,361,213,121,178
SV,California,435,217,171,250
SV,California,222,215,62,92
SV,California,177,218,93,138
SV,California,177,222,120,178
SV,California,156,228,169,250
SV,California,239,225,64,94
SV,California,139,229,97,141
SV,California,198,234,125,182

The records are in order of company_id,state,profit,feature1,feature2,feature3.

Now I wrote this code which breaks he whole dataset into chunks of 12 records (for each company and for each state in that company there are 12 records) and then passes it to process_chunk() function. Inside process_chunk() the records in the chunk are processed and broken into test set and training set with record number 10 and 11 going into test set while rest going into training set. I also store the company_id and state of records in test set into a global list for future display of predicted values. I also append the predicted values to a global list final_prediction

Now the issue that I am facing is that company_list, state_list and test_set lists have the same size (of about 200 records) but final_prediction has size half of what other lists have (100) records. If the test_set list has size of 200 then shouldn't the final_prediction be also of size 200? My current code is:

from sklearn import linear_model
import numpy as np
import csv

final_prediction = []
company_list = []
state_list = []

def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    np.set_printoptions(suppress=True)

    prediction_list = []


    # to divide into training & test, I am putting line 10th and 11th in test set
    count = 0
    for line in chuk:
        # Converting strings to numpy arrays
        if count == 9:   
            test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
            test_set_label_list.append(np.array(line[2],dtype = np.float))
            company_list.append(line[0])
            state_list.append(line[1])

        elif count == 10:
            test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
            test_set_label_list.append(np.array(line[2],dtype = np.float))
            company_list.append(line[0])
            state_list.append(line[1])

        else:    
            training_set_feature_list.append(np.array(line[3:4],dtype = np.float))
            training_set_label_list.append(np.array(line[2],dtype = np.float))
        count += 1
    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)



    prediction_list.append(regr.predict(test_set_feature_list))
    np.set_printoptions(formatter={'float_kind':'{:f}'.format})
    for items in prediction_list:
        final_prediction.append(items)




# Load and parse the data
file_read = open('data.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder

#process_chunk(chunk)


print len(company_list)
print len(test_set_feature_list)
print len(final_prediction)

Why is this difference in size coming and what mistake am I doing here in my code that I can rectify (maybe something that I am doing very naively and can be done in better way)?

1
  • why don't you use pandas? It has chunking support in the csv reader. Commented Sep 19, 2015 at 17:18

1 Answer 1

1

Here:

prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
    final_prediction.append(items)

prediction_list will be a list of arrays (since predict returns an array).

So you'll be appending arrays to your final_prediction, which is probably what messes up your count: len(final_prediction) will probably be equal to the number of chunks.

At this point, the lengths are ok if prediction_list has the same length as test_set_feature_list.

You probably want to use extend like this:

final_prediction.extend(regr.predict(test_set_feature_list))

Which is also easier to read.

Then the length of final_prediction should be fine, and it should be a single list, rather than a list of lists.

Sign up to request clarification or add additional context in comments.

3 Comments

my mistake. it was a type which I have corrected. the test_set_feature_list has the feature records on which the prediction is done once the model is build.
also I am appending prediction results to final_predicton because the process_chunk() is called for every 12 records in the dataset. So if I had done final_prediction = regr.predict(test_set_feature_list) wouldn't final_prediction contain prediction result for the last run of process_chunk()?
when I print my final_prediction list I see output as [array([93495052.969556, 98555123.061462]), array([1000976814.605984, 998276347.359732]),.....]. What does two values in each array() mean?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.