Organize data into appropriate columns using python

Question

I am using vertica_python to pull data from the database. The column that I pull comes as a string in the following format:

[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"inte   cept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]

I then split and parse this sting and load it into excel in the following format, each index being a cell:

prediction_type CONV_PROBABILIT calibration_factor  0.90655 intercept   -2.41041    advMatchTypeId  -0.23987    atsId   1.44701 deviceTypeId    0.19701 dmaCode -0.69982    keywordId   0.44224

Here's my problem.The string doesn't have a definite format, meaning, that sometimes I be missing some features from the string, messing up my formatting. Here's an example:

intercept   -2.41041    advMatchTypeId  -0.23987    deviceTypeId    0.37839 dmaCode -0.53552    keywordId   0.44224     
intercept   -2.41041    advMatchTypeId  -0.23987    atsId   0.80708 deviceTypeId    -0.19573    dmaCode -0.69982    keywordId   0.44224

How can I retain formatting the way I want and have the above example come out looking like this instead:

intercept   -2.41041    advMatchTypeId  -0.23987                     deviceTypeId   0.37839     dmaCode -0.53552    keywordId   0.44224
intercept   -2.41041    advMatchTypeId  -0.23987    atsId   0.80708  deviceTypeId   -0.19573    dmaCode -0.69982    keywordId   0.44224

This is the code I am using:

data_all = cur.fetchall()

for i in range(len(data_all)):
    col = 0
    data_one = ''.join(data_all[i])
    raw_coef = data_one.split(',')[1:len(data_all)]
    for j in range(len(raw_coef)):
        raw = ''.join(raw_coef[j])
        raw = re.sub('"|}|{|[|]|', '', raw)[:-1]
        raw = raw.split(":")
        for k in range(len(raw)):
            worksheet.write(i, col, raw[k], align_left)
            feature.append(raw[0]) # for unique values
            col+=1

My query:

cur.execute(
"""
select MODEL_COEF
from

dcf_funnel.ADV_BIDDER_PRICING_LOG
where MODEL_ID = 8960
and DATE(AMP_QUERY_TIMESTAMP) = '11-02-2016'
"""
)

Please add the code you currently use to split the data up and organise it to write to Excel. — roganjosh
– roganjosh, Commented Nov 3, 2016 at 17:36
That's better, thanks. You are not getting a string back, you're getting a list that contains a dictionary. You appear to be be converting it to a string and then trying to use regex to split it all back up again. I need to check something with cursor properties and then I will try put something together — roganjosh
– roganjosh, Commented Nov 3, 2016 at 17:46
This string comes from a single column, if I understand your comment correctly. — opamp
– opamp, Commented Nov 3, 2016 at 17:47
It is not a string at all. It's a valid Python data structure (a list containing a dictionary). It only becomes a string when you do ''.join(). You seem to be shooting yourself in the foot with that part. I normally use SQLite which returns tuples, but I can't think of any reason a query would ever return a string that you have to chop up with regex. — roganjosh
– roganjosh, Commented Nov 3, 2016 at 17:50
That is what I'm looking into now for you. But, without trying to sound too critical, "in any case" isn't really the correct response. You absolutely need to be able to recognise dict and list in Python to be able to do anything useful. You've made this task near impossible for yourself without knowing that. — roganjosh
– roganjosh, Commented Nov 3, 2016 at 17:55

chthonicdaemon · Accepted Answer · 2016-11-04 10:58:48Z

3

You can skip all your parsing and use pandas:

import pandas

This will read your query result into a DataFrame if it is already a list of dicts in Python.

data_all_list = [{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]
df = pandas.DataFrame(data_all_list)

If you really have string, you can just use read_json:

data_all_str = """[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]"""
df = pandas.read_json(data_all_str)

Further thought has led me to understand that your data_all is actually a list of lists of dicts, something like this:

data_all_lol = [data_all_list, data_all_list]

In this case, you need to concatenate the lists before passing to DataFrame:

df = pandas.DataFrame(sum(data_all_lol, []))

This will write it in the normal headers + values format:

df.to_csv('filename.csv') # you can also use to_excel

If your final goal is just to obtain the means of all the features, pandas can do that straight away, with an arbitrary number of columns, handling missing values correctly:

df.mean()

Gives

advMatchTypeId       -0.239877
atsId                -0.135568
calibration_factor    0.906556
deviceTypeId          0.439130
dmaCode              -0.251728
id                    0.000000
intercept            -2.410414
keywordId             0.442240

Note about ambiguity

In the OP it is hard to know the type of data_all because the snippet you show appears like a list of dicts in literal syntax, but you say "The column that I pull comes as a string".

Notice the difference between the way the inputs are represented in the following IPython session:

In [15]: data_all_str
Out[15]: '[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]'

In [16]: data_all_list
Out[16]:
[{'advMatchTypeId': -0.239877,
  'atsId': -0.135568,
  'calibration_factor': 0.906556,
  'deviceTypeId': 0.43913,
  'dmaCode': -0.251728,
  'id': 0,
  'intercept': -2.410414,
  'keywordId': 0.44224,
  'prediction_type': 'CONV_PROBABILITY'}]

edited Nov 4, 2016 at 10:58

answered Nov 3, 2016 at 18:10

chthonicdaemon

19.9k2 gold badges55 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

opamp Over a year ago

Thank you! But this will put the whole string into a single cell. I need to be able to parse it in order to get a description of the feature and average value. Therefore, I would like to have each cell contain a unique feature and its value

chthonicdaemon Over a year ago

I've updated the answer to show how to read the data from a json-compliant string, although it wasn't clear that you really have a string rather than a list of dicts.

roganjosh Over a year ago

@opamp Aha, now I see what you mean based on your last comment to me. Your very first line of code should have looked like

"""[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"inte   cept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]"""

. You really were getting a string of actual data structures, but you reported it as an actual list suggesting that you were misusing the term string. This answer is probably correct for you.

opamp Over a year ago

@chthonicdaemon the result you got is amazing. I believe I am confusing everyone and myself misusing words like string, list, dict. I will look into it asap. But in order to accomplish the result you got did you use df = pandas.read_json(data_all) or df = pandas.DataFrame(data_all). I get an error when using read_json saying a string is expected.

chthonicdaemon Over a year ago

Use the one that works for you. If the result of the query is a list of dicts the DataFrame one will work.

|

Collectives™ on Stack Overflow

Organize data into appropriate columns using python

1 Answer 1

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related