0

I am using python 2.7 with dask and trying to query a db table from a remote machine to a dask dataframe

I have a multiple column index in the table, and I try to read it using the following script

ddf = dd.read_sql_table("table name", "mysql://user:pass@ip:port/Dbname",spesific column name).head()

And getting the following error

start = asanyarray(start) * 1.0 TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('S32')

dtype('S32') dtype('S32')

I got the sqlalchemy uri as explained here

i'm not sure what's the problem, when I try to query by another column as the index, and only use the ddf head(), i don't get an error, and when I try to compute the whole ddf i get the same error, i assume it's an issue regarding the column not being of unique values, I don't have a single column index, but multiple column, what is the solution to read the entire table here?

Thanks.

full traceback

> Traceback (most recent call last):   File "path", line 28, in <module>
>     ddf = dd.read_sql_table("tablename", "mysql://user:pass@ip:port/dbname","indexcolumn")   File "file", line
> 123, in read_sql_table
>     divisions = np.linspace(mini, maxi, npartitions + 1).tolist()   File
> "/home/user/.local/lib/python2.7/site-packages/numpy/core/function_base.py",
> line 108, in linspace
>     start = asanyarray(start) * 1.0 TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('S32')
> dtype('S32') dtype('S32')
6
  • Can you verify that the equivalent pandas operation works? Commented Nov 30, 2017 at 12:47
  • Please show a more detailed traceback and perhaps run debug to find the value of start when the error happens. Commented Nov 30, 2017 at 13:47
  • @MRocklin works great with pandas Commented Nov 30, 2017 at 14:25
  • @mdurant the value of start is {str}'-1000001542' Commented Nov 30, 2017 at 14:27
  • Looks like you see a string that ought to be a number Commented Nov 30, 2017 at 16:17

1 Answer 1

2

For the case where you provide no further information or only specify number of partitions, the partitioning logic in read_sql_table only works for numbers, because we need a way to make ordered divisions between the minimum and maximum values.

Apparently, but the query (to get the max/min) is returning a string for this case. read_sql_table can still work, but you will need to define the divisions to split on yourself, and supply them with the divisions keyword, e.g.,

ddf = dd.read_sql_table("table name", "mysql://user:pass@ip:port/Dbname", 
    'index_col', divisions=['aardvark', 'llama', 'tapir', 'zebra']).head()

Alternatively, the string in question certainly looks like a number, so you might need to update the schema of the table to make sure it is interpreted as a number.

Sign up to request clarification or add additional context in comments.

4 Comments

Could you provide a full example of the first solution?
could you provide with a working example for the first solution?
thanks! so, just for me to understand it all the way, the divisions are in terms of the index column and the index column only? (and how will it use them, given that you need to sort the strings to fit within those divisions, does it use lexicographic order?
Yes, the divisions are the boundaries of each partition for the index column, say the second partition would be WHERE index_col > "llama" AND index_col <= "tapir". If you provide divisions, it is up to you to order them, and understand what your DB will understand by it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.