2

I am reading data from csv to perform feature elimination. Here is how data look like

shift_id    user_id status  organization_id location_id department_id   open_positions  city    zip role_id specialty_id    latitude    longitude   years_of_experience
0   2   9   S   1   1   19  1   brooklyn    48001.0 2.0 9.0 42.643  -82.583 NaN
1   6   60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
2   9   61  S   12  19  20  1   new york    48001.0 1.0 7.0 42.643  -82.583 NaN
3   10  60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
4   21  3   S   1   1   19  1   pune    48001.0 1.0 2.0 46.753  -89.584 0.0

Here is my code -

dataset = pd.read_csv("data.csv",header = 0)
data = pd.read_csv("data.csv",header = 1)
target = dataset.location_id
#dataset.head()
svm = LinearSVC()
rfe = RFE(svm, 3)
rfe = rfe.fit(data, target)
print(rfe.support_)
print(rfe.ranking_)

But I am getting this error

ValueError: could not convert string to float: '1,141'

There is not string like this in my database.

There are some empty cell. So I tried to use -

result.fillna(0, inplace=True)

Which gave this error

ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Any suggestion how to preprocess this data correctly?

Here is link to sample data- https://gist.github.com/karimkhanp/6db4f9f9741a16e46fc294b8e2703dc7

15
  • 1
    (float('1,141'.replace(",",".")) should do it Commented Feb 14, 2019 at 11:47
  • @user5173426: Actually when I check My data, I dont find 1,141 any where in my data Commented Feb 14, 2019 at 12:23
  • Can you show us an output of something like cat prod_data_for_ML.csv | grep 141 executed in the folder where your file is, assuming you're on Linux? Commented Feb 14, 2019 at 12:35
  • @SergeyBushmanov: Supricingly it is ``141,5,S,14,23,33,1,newton,"48001",3,15,42.643,-82.583,2 "1,141","1,139",A,14,24,77,1,OWINGS MILLS,"21117",8,,39.41,-76.79, "4,141","1,694",A,16,34,124,1,Redmonds,"98051",1,2,47.33,-121.879, "5,141","4,584",A,122,179,307,1,Gotham,"02458",1,7,42.35,-71.186, Commented Feb 14, 2019 at 12:40
  • 1
    Tell us do you want 1,141 to be 1141 or 1.141? Commented Feb 14, 2019 at 12:42

3 Answers 3

3

The solution to your ValueError: could not convert string to float: '1,141' is using a thousands param in your pd.read_csv():

dataset = pd.read_csv("data.csv",header = 0, thousands= r",")
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 14 columns):
shift_id                3 non-null int64
user_id                 3 non-null int64
status                  3 non-null object
organization_id         3 non-null int64
location_id             3 non-null int64
department_id           3 non-null int64
open_positions          3 non-null int64
city                    3 non-null object
zip                     3 non-null int64
role_id                 3 non-null int64
specialty_id            2 non-null float64
latitude                3 non-null float64
longitude               3 non-null float64
years_of_experience     3 non-null object
dtypes: float64(3), int64(8), object(3)
memory usage: 416.0+ bytes
Sign up to request clarification or add additional context in comments.

1 Comment

You may also need to declare any values that representing null values using the argument na_values=...' and give it a list, or those vals may cause pandas to choke. Perhaps also make the conversion to float explicit with dtype={'some_col':float}`
1

Your question contains result.fillna(0, inplace=True).

But since result appears nowhere before, it is not clear, what is its value (probably a scalar).

Another weird detail in your code. Look at:

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)
data = pd.read_csv("prod_data_for_ML.csv",header = 1)

Note that you read twice, from the same file, but:

  • the first time you read with header = 0, so, as the documentation states, column names are inferred from the first line,
  • the second time you read with header = 1.

Is this your intention? Or maybe in both calls header should be the same?

And one more remark: Reading 2 times from the same file is (in my opinion) unnecessary. Maybe your code should be like this:

data = pd.read_csv("prod_data_for_ML.csv",header = 0)
target = data.location_id

Edit

As I undestood from your comments, you want:

  • the first table - dataset - with the first column (shift_id),
  • the second table - data - without this column.

Then your code should contain:

dataset = pd.read_csv("data.csv",header = 0)  # Read the whole source file, reading column names from the starting row
data = dataset.drop(columns='shift_id')       # Copy dropping "shift_id" column
...

Note that header=1 does not "skip" any column, but states only from which source row read column names. In this case:

  • Row No 0 (the starting row, containing actual column names) is skipped.
  • Column names are read from the next row (due to header=1), containing actually the first row of data.
  • Only the remaining rows are read into rows of the target table.

If you want to "skip" some source columns, call read_csv with usecols parameter, but it specifies which columns to read (not to skip).

So, assuming that your source file has 14 columns (numbered from 0 to 13), and you want to omit only the first (number 0), you could write usecols=[*range(1, 14)] (note that the upper limit (14) is not included in the range).

And one more remark concerning your data sample: The first column is the index, without any name. shift_id is the next column, so to avoid confusion, you should put some indentation in the first row.

Note that City column is in your header at position 8, but in data rows (Brooklyn, test) at position 9. So the "title" row (column names) should be indented.

Edit 2

Look at your comment to the question, written 2019-02-14 12:40:19Z. It contains a row like this:

"1,141","1,139",A,14,24,77,1,OWINGS MILLS,"21117"

It shows that first 2 columns (shift_id and user_id) contain string representation of a float but with a comma instead of a dot.

You can cope with this problem using your own converter function, e.g.:

def cnvToFloat(x):
    return float(x.replace(',', '.'))

and call read_csv passing this function in convertes parameter, for such "required" (ill-formatted) columns e.g.:

dataset = pd.read_csv("data.csv", header = 0, 
    converters={'shift_id': cnvToFloat, 'user_id': cnvToFloat})

5 Comments

The reason why I read twice is; in target I want first column and in another data variable I dont want first column so I started with header = 1
data = pd.read_csv("prod_data_for_ML.csv",header = 1) target = data.location_id will give me error AttributeError: 'NoneType' object has no attribute 'location_id'
header = 1 means actually "read column names from row 2" (row numeration is from 0). So probably: you skip row 0 (actual column names), read the first data row as column names. Then (probably) no column name read from there is location_id, so you get the error (you try to refer to a non-exixting column).
@Can you please update code to avoid this two read and fulfill location_id too
The "non-existing" string 1,141 may be a result of "glueing together" of two adjacent columns, the first containing "1", then comes the comma separating columns and then "141". As the source file is .csv, you may perform such check: Open the file using any text editor. Then look for 1,141 in its content. If you find something like this, look very thoroughly at this row. Maybe some column contains a comma, causing wrong division of this source row into destination columns.
0

1,141 is an invalid float.

To convert it to float, you should first convert it to a valid type, replacing , with . and then casting it to float would work.

bad_float = '1,141'

print(float(bad_float.replace(",",".")))

OUTPUT:

1.141

EDIT:

As noted by @ShadowRanger, Unless the comma is actually supposed to be a comma for separating digit groupings (to make it more human readable):

comm_sep = '1,141'

res = comm_sep.split(",")

print(float(res[0]), float(res[1]))

OUTPUT:

1.0 141.0

EDIT 2:

The issue was resolved by OP as he changed the column type to number explicitly from the csv file editor.

5 Comments

Unless the comma is actually supposed to be a comma for separating digit groupings (to make it more human readable), in which case you just get rid of the commas, you don't replace them with periods. The sample data uses periods for the decimal separator after all.
@ShadowRanger Agreed, if that is the case, Added an example for that as well.
You slightly misinterpreted (my fault for not being more explicit), but your misinterpretation of my point is also a possibility. :-) I was suggesting that the number was actually 1141, with comma inserted every three digits. So a larger number might appear as 12,345,678, and be intended to mean 12345678.
Thanks everyone - But issue is when I check my data it does not contain 1,141 anywhere. So converting to float does not make sense. Or I am going wrong?
@JhonPatric you just mentioned you want 1,141 to be 1141: print(int('1,141'.replace(",",""))) ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.