Instructing R to find variable name in rows when reading csv file

Question

Is there a way to have R read the column/variable name in each cell when reading csv file?

My csv file is malformed. Not every row has every variable and not every row is of the same length. However, every row has a variable name within it, e.g. "id": "37189", "city": "Phoenix", "type": "business". When I tell R to read the csv can I instruct it to find the variable name within the data and sort accordingly?

Data sample for your convenience:

business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

business_id: UsFtqoBl7naz8AVUBZMjQQ,full_address: 202 McClure St\nDravosburg, PA 15034, open: true, categories: [Nightlife], city: Dravosburg, review_count: 4, name: Clancy's Pub, neighborhoods: [], longitude: -79.886930000000007, state: PA, stars: 3.5, latitude: 40.350518999999998, attributes: Happy Hour: true, Accepts Credit Cards: true, Good For Groups: true, Outdoor Seating: false, Price Range: 1, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

business_id: cE27W9VPgO88Qxe4ol6y_g,{ full_address: 1530 Hamilton Rd\nBethel Park, PA 15234}, open: false, categories: [Active Life, Mini Golf, Golf], city: Bethel Park, review_count: 5, name: Cool Springs Golf Center, neighborhoods: [], longitude: -80.015910000000005, state: PA, stars: 2.5, latitude: 40.356896200000001, attributes: Good for Kids: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

In bold are a few of the variables which do not appear in other entries.

Are you sure it's a csv format? It looks like json or something similar. — cocquemas
– cocquemas, Commented Oct 24, 2015 at 3:21
i agree with hfty. some evil person took JSON, removed brackets and shoved it into field 1 of a CSV file. is there no way to go back to the source of this file and ask the creator just to give you the JSON? dealing with (what is 99% most likely) the nested structure in attributes without brackets is going to involve some interesting parsing/munging. — hrbrmstr
– hrbrmstr, Commented Oct 24, 2015 at 10:19
i knew i recognized this. how'd you get perfectly good Yelp API JSON into this format? — hrbrmstr
– hrbrmstr, Commented Oct 24, 2015 at 10:35
@hrbrmstr Oh that'd make sense--I don't know anything about JSON but this file is so messed it would make sense it's not supposed to be cvs. Thanks. — Unrelated
– Unrelated, Commented Oct 24, 2015 at 16:51

hrbrmstr · Accepted Answer · 2015-10-24 11:21:10Z

This will get you started but you still have quite a bit of work to do. This works for one line (and it may work for the other two in the example) but it can be extrapolated to work with all of the lines (lapply FTW). Basically you need to rebuild the JSON structure from that single field (there may be alternative ways, especially if you do not need all the fields). It's easier than it might otherwise be since the Yelp schema is known.

You have to attack it in a pretty deterministic way, converting some fields before others, accounting for spaces in field names, dealing with arrays & nested structures, etc. As I said, you have quite a bit of work ahead of you. If your regex-fu is weak, this will provide ample practice to become a regex ninja.

library(stringi)
library(stringr)
library(jsonlite)

txt <- 'business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business'
txt <- gsub("\n", "|", txt)

txt <- sub("business_id: ([[:alnum:]\\:]+)", '"business_id": "\\1"', txt)

txt <- sub('attributes: ', '"attributes": {', txt)
txt <- sub('By Appointment Only: ', '"By Appointment Only": ', txt)
txt <- sub('Accepts Credit Cards: ', '"Accepts Credit Cards": ', txt)
txt <- sub('Good For Groups: ', '"Good For Groups": ', txt)
txt <- sub('Outdoor Seating: ', '"Outdoor Seating": ', txt)
txt <- sub('Price Range: ', '"Price Ranges": ', txt)

txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), close:", '"full_address": "\\1", close:', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), open:",  '"full_address": "\\1", open:', txt)

txt <- sub("name: (.*), neighborhoods:", '"name": "\\1", "neighborhoods":', txt)

txt <- gsub("open: ([[:alnum:]\\:]+)", '"open": "\\1"', txt)
txt <- sub("close: ([[:alnum:]\\:]+)", '"close": "\\1"', txt)

txt <- sub("longitude: ([[:digit:]\\.-]+)", '"longitude": "\\1"', txt)
txt <- sub("latitude: ([[:digit:]\\.-]+)", '"latitude": "\\1"', txt)

txt <- sub("review_count: ([[:digit:]\\.]+)", '"review_count": "\\1"', txt)
txt <- sub("stars: ([[:digit:]\\.]+)", '"stars": "\\1"', txt)
txt <- sub("state: ([[:alpha:]]+)", '"state": "\\1"', txt)
txt <- sub("city: ([[:alpha:] \\.-]+)", '"city": "\\1"', txt)

txt <- sub("type: ([[:alpha:]]+)", '"type": "\\1"', txt)

cats <- paste0(sprintf('"%s"', str_trim(str_split(str_match_all(txt, "categories: \\[([[:alpha:] &-,]+)\\],")[[1]][,2], ",")[[1]])), collapse=", ")
txt <- sub("categories: \\[([[:alpha:] &-,]+)\\],", '"categories": [' %s+% cats %s+% '],', txt)

txt <- "{" %s+% txt %s+% "}}"

fromJSON(txt)
## $business_id
## [1] "vcNAWiLM4dR7D2nwwJ7nCA"
## 
## $full_address
## [1] "4840 E Indian School Rd|Ste 101|Phoenix, AZ 85018"
## 
## $close
## [1] "17:00"
## 
## $open
## [1] "08:00"
## 
## $open
## [1] "true"
## 
## $categories
## [1] "Doctors"          "Health & Medical"
## 
## $city
## [1] "Phoenix"
## 
## $review_count
## [1] "9"
## 
## $name
## [1] "Eric Goldberg, MD"
## 
## $neighborhoods
## list()
## 
## $longitude
## [1] "-111.98375799999999"
## 
## $state
## [1] "AZ"
## 
## $stars
## [1] "3.5"
## 
## $latitude
## [1] "33.499313000000001"
## 
## $attributes
## $attributes$`By Appointment Only`
## [1] TRUE
## 
## $attributes$type
## [1] "business"

And, whomever gave you this file deserves whatever evil comes their way in their programmatic life. I'd give them back whatever they wanted from this in gnarly XML with EBCDIC encoding.

Collectives™ on Stack Overflow

Instructing R to find variable name in rows when reading csv file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related