I have a dataset the sample structure of which looks like this:
SV,Arizona,618,264,63,923
SV,Arizona,367,268,94,138
SV,Arizona,421,268,121,178
SV,Arizona,467,268,171,250
SV,Arizona,298,270,62,924
SV,Arizona,251,272,93,138
SV,Arizona,215,276,120,178
SV,Arizona,222,279,169,250
SV,Arizona,246,279,64,94
SV,Arizona,181,281,97,141
SV,Arizona,197,286,125.01,182
SV,Arizona,178,288,175.94,256
SV,California,492,208,63,923
SV,California,333,210,94,138
SV,California,361,213,121,178
SV,California,435,217,171,250
SV,California,222,215,62,92
SV,California,177,218,93,138
SV,California,177,222,120,178
SV,California,156,228,169,250
SV,California,239,225,64,94
SV,California,139,229,97,141
SV,California,198,234,125,182
The records are in order of company_id,state,profit,feature1,feature2,feature3.
Now I wrote this code which breaks he whole dataset into chunks of 12 records (for each company and for each state in that company there are 12 records) and then passes it to process_chunk() function. Inside process_chunk() the records in the chunk are processed and broken into test set and training set with record number 10 and 11 going into test set while rest going into training set. I also store the company_id and state of records in test set into a global list for future display of predicted values. I also append the predicted values to a global list final_prediction
Now the issue that I am facing is that company_list, state_list and test_set lists have the same size (of about 200 records) but final_prediction has size half of what other lists have (100) records. If the test_set list has size of 200 then shouldn't the final_prediction be also of size 200? My current code is:
from sklearn import linear_model
import numpy as np
import csv
final_prediction = []
company_list = []
state_list = []
def process_chunk(chuk):
training_set_feature_list = []
training_set_label_list = []
test_set_feature_list = []
test_set_label_list = []
np.set_printoptions(suppress=True)
prediction_list = []
# to divide into training & test, I am putting line 10th and 11th in test set
count = 0
for line in chuk:
# Converting strings to numpy arrays
if count == 9:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
elif count == 10:
test_set_feature_list.append(np.array(line[3:4],dtype = np.float))
test_set_label_list.append(np.array(line[2],dtype = np.float))
company_list.append(line[0])
state_list.append(line[1])
else:
training_set_feature_list.append(np.array(line[3:4],dtype = np.float))
training_set_label_list.append(np.array(line[2],dtype = np.float))
count += 1
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(training_set_feature_list, training_set_label_list)
prediction_list.append(regr.predict(test_set_feature_list))
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
for items in prediction_list:
final_prediction.append(items)
# Load and parse the data
file_read = open('data.csv', 'r')
reader = csv.reader(file_read)
chunk, chunksize = [], 12
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
# process the remainder
#process_chunk(chunk)
print len(company_list)
print len(test_set_feature_list)
print len(final_prediction)
Why is this difference in size coming and what mistake am I doing here in my code that I can rectify (maybe something that I am doing very naively and can be done in better way)?