2

I am new to Python and I am very confused with all these data type such as Series, Array, List etc. Probably this is a very open ended question. I am hoping to get a feel on the general practice when coding in python for data analysis.

Lots of readings have been suggesting that numpy and pandas are the two modules I needed for data analysis. However, I find it hard and weird as they are operating/generating data in two different data types, i.e. Series and Array. Is it normal/natural that one needs to convert either one of the data type to another one before any kind of data manipulation? Would like you know what would you do? Many thanks.

for example:

 import pandas as pd
 import numpy as np

 # create some data
 df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])
 x = np.random.randn(10, 1)

 # data manipulation
 A = df['a']

 # Question 1:
 # If I want to perform a element by element addition between x and A
 # How should I do?  Simple x + A doesn't work but it seems strange to 
 # me that if I have to convert the data type everytime 

 # Question 2:
 # I'd like to combine to two columns together
 # concatenate or hstack both don't work
2
  • What are you wanting to get numpy.arrays or pd.Series and pd.Dataframes? Commented Feb 2, 2016 at 8:06
  • I presume I would want dataframe at the end as I start with dataframe (since i import data using pandas). Basically, I find them not compatible with each other (the two modules) which is annoying and wondering if i am in the right direction (require an extra step/function in almost each operation). Commented Feb 2, 2016 at 8:50

2 Answers 2

2

For addition your arrays/Series should be with the same dimensions:

In [98]: A.shape
Out[98]: (10,)

In [99]: x.shape
Out[99]: (10, 1)

You could cast reshape(-1) to convert your vector to array:

In [100]: x.reshape(-1).shape
Out[100]: (10,)

Then you could add that with pd.Series A:

In [61]: A + x.reshape(-1)
Out[61]:
0   -1.186957
1   -0.165563
2    0.882490
3    4.544357
4    2.698414
5    0.396110
6   -0.199209
7    3.282942
8    2.448213
9   -0.543727
Name: a, dtype: float64

For your 2nd question you need to reshape your A Series for the vector. You could do it with reshape:

In [97]: np.hstack([A.values.reshape(A.size,1), x])
Out[97]:
array([[ 0.3158111 , -1.50276813],
       [-1.09532212,  0.92975954],
       [-0.77048623,  1.65297592],
       [ 2.14690242,  2.39745455],
       [ 1.63367806,  1.06473634],
       [ 0.09134512,  0.3047644 ],
       [ 0.02019805, -0.21940726],
       [ 0.87008192,  2.41286007],
       [ 1.25315724,  1.19505578],
       [-0.60156045,  0.05783343]])

If you want to get pd.DataFrame you could use pd.concat:

In [108]: pd.concat([A, pd.Series(x.reshape(-1))], axis=1)
Out[108]:
          a         0
0  0.315811 -1.502768
1 -1.095322  0.929760
2 -0.770486  1.652976
3  2.146902  2.397455
4  1.633678  1.064736
5  0.091345  0.304764
6  0.020198 -0.219407
7  0.870082  2.412860
8  1.253157  1.195056
9 -0.601560  0.057833

EDIT

From docs for reshape(-1):

newshape : int or tuple of ints
The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

Sign up to request clarification or add additional context in comments.

3 Comments

what does .reshape(-1) do/mean? Thanks
Edited answer for that
@Lafayette note, that reshape(-1) will work for any shape of your original array while reshape(10) will be acceptable only for vector with 10 size.
1

Is it normal/natural that one needs to convert either one of the data type to another one before any kind of data manipulation?

Sometimes you need to, sometimes you don't. When in doubt, do it.

That said, remember the Zen of Python:

  • Explicit is better than implicit.
  • In the face of ambiguity, refuse the temptation to guess.

Even if some APIs will do their best to convert types for you (numpy and pandas are quite good at that), explicit type casting can make your code more readable and easier to debug.

Question 1: If I want to perform a element by element addition between x and A How should I do? Simple x + A doesn't work but it seems strange to me that if I have to convert the data type everytime

You do not have to convert data types in this case but you need compatible shapes.

>>> print(A.shape)
(10,)
>>> print(x.shape)
(10, 1)
>>> print(A + x.reshape(10))
0   -0.207131
1   -2.117012
2    0.925545
3   -2.187705
4    1.226458
5    2.144904
6   -0.956781
7    1.956246
8    0.060132
9    1.332417
Name: a, dtype: float64

Question 2: I'd like to combine to two columns together concatenate or hstack both don't work

It is not clear what the desired output is but I think it is again a matter of shapes, not types. Here is an option the pandas way:

>>> print(pd.concat([A, pd.Series(x.reshape(10))], axis=1))
          a         0
0 -0.158667 -0.048463
1 -0.847246 -1.269765
2 -0.128232  1.053778
3 -1.316113 -0.871593
4  1.057044  0.169414
5  3.188343 -1.043439
6 -0.032524 -0.924257
7  1.412443  0.543803
8 -0.730386  0.790519
9  0.289796  1.042621

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.