How can you read a CSV file in R with different number of columns

Question

I have a sparse data set, one whose number of columns vary in length, in a csv format. Here is a sample of the file text.

12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco

When I use

read.csv("data.txt", header = F)

R will interpret the data set as having 3 columns because the size is determined from the first 5 rows. Is there anyway to force r to put the data in more columns?

My intuition is that specifying the colClasses argument in read.table (with the max number of columns) in combination with fill = TRUE should read the file in. — Blue Magister
– Blue Magister, Commented Sep 20, 2013 at 17:38
could you make a dummy data.frame with 2 rows and the correct number of columns, and then rbind the text file to it? — John Paul
– John Paul, Commented Sep 20, 2013 at 17:48

Community · Accepted Answer · 2017-05-23 11:46:58Z

Deep in the ?read.table documentation there is the following:

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

Therefore, let's define col.names to be length X (where X is the max number of fields in your dataset), and set fill = TRUE:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

read.table(dat, header = FALSE, sep = ",", 
  col.names = paste0("V",seq_len(7)), fill = TRUE)

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

If the maximum number of fields is unknown, you can use the nifty utility function count.fields (which I found in the read.table example code):

count.fields(dat, sep = ',')
# [1] 2 3 2 2 2 2 3 3 7
max(count.fields(dat, sep = ','))
# [1] 7

Possibly helpful related reading: Only read limited number of columns in R

Roland · Accepted Answer · 2013-09-20 17:39:09Z

7

You could read the data like this:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

dat <- readLines(dat)
dat <- strsplit(dat, ",")

This results in a list.

answered Sep 20, 2013 at 17:39

Roland

134k12 gold badges203 silver badges305 bronze badges

3 Comments

CompChemist Over a year ago

The data set that I have is large. I am looking for a solution without copying and pasting the contents of the file. I know I can open the file in ruby and search for the largest amount of commas in a line and move that line to the first row. I could then then open the file in R and all would be solved, but I was hoping for a simple solution in R.

Roland Over a year ago

Well, obviously you would use a file connection (read ?connection). But I don't have access to your file ...

Blue Magister Over a year ago

@CompChemist Put your file object (data.txt) in place of dat. The textConnection was used to quickly read in your example file.

Arun · Accepted Answer · 2013-09-20 17:52:25Z

3

This does seem to work (following @BlueMagister's suggestion):

tt <- read.table("~/Downloads/tmp.csv", fill=TRUE, header=FALSE, 
          sep=",", colClasses=c("numeric", rep("character", 6)))
names(tt) <- paste("V", 1:7, sep="")

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

answered Sep 20, 2013 at 17:52

Arun

119k28 gold badges290 silver badges396 bronze badges

2 Comments

Roland Over a year ago

I've just tried again. This doesn't works if I use the text argument.

Arun Over a year ago

Aha.. so that was the reason.. good to know this difference! Thanks for writing back.

OndroV · Accepted Answer · 2020-03-02 21:02:07Z

3

I faced a similar challenge, but count.fields from Blue Magister´s answer didn't work, probably because commas inside fields conflicted with sep=",". In addition, number of columns varied from file to file. So I just defined excess col.names in read.table(100 was enough in my case) and then I used which(!is.na()) to get rid of excess columns.

dat <- read.table("path/to/file.csv", col.names = paste("V",1:100), fill = T, sep = ",")
dat <- dat[,which(!is.na(dat[1,]))]

edited Mar 2, 2020 at 21:02

answered Feb 27, 2020 at 9:11

OndroV

315 bronze badges

Comments

user12756182 · Accepted Answer · 2020-01-21 17:21:17Z

1

Try this, it is a bit more dynamic..

readVariableWidthFile <- function(filePath){
  con <-file(filePath)
  lines<- readLines(con)
  close(con)
  slines <- strsplit(lines,",")
  colCount <- max(unlist(lapply(slines, length)))

  FileContent <- read.csv(filePath,
                        header = FALSE,
                        col.names = paste0("V",seq_len(colCount)),
                        fill = TRUE)
  return(FileContent)
}

answered Jan 21, 2020 at 17:21

user12756182

111 bronze badge

2 Comments

Sean Pianka Over a year ago

Please add more explanation to your answer. What does your answer add that the currently accepted answer does not?

mikey Over a year ago

I agree this is more dynamic as it allows you to loop through csvs and not need to specify your number of columns

Collectives™ on Stack Overflow

How can you read a CSV file in R with different number of columns

5 Answers 5

Comments

3 Comments

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

3 Comments

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related