1

I have a list here that looks like this:

head(h)
[[1]]
[1] "gene=dnaA"             "locus_tag=CD630_00010" "location=1..1320"     

[[2]]
character(0)

[[3]]
[1] "locus_tag=CD630_05950"   "location=719777..720313"

[[4]]
[1] "gene=dnrA"             "locus_tag=CD630_00010" "location=50..1320" 

I'm having trouble trying to manipulate this list to create a data.frame with three columns. For the rows with missing gene info, I want to list them as "gene=unnamed" and completely remove the empty rows into a matrix as shown:

     [,1]        [,2]                    [,3]                             
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"              
[2,] "gene=thrA" "locus_tag=CD630_05950" "location=719777..720313"             
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"            

This is what I have right now, but I get an error about missing values in the gene column. Any suggestions?

  h <- data.frame(h[lapply(h,length)>0])
  h <- t(h)
  rownames(h) <- NULL

2 Answers 2

1
# Data

l <- list(c("gene=dnaA","locus_tag=CD630_00010", "location=1..1320"),
character(0), c("locusc_tag=CD630_05950", "location=719777..720313"),
c("gene=dnrA","locus_tag=CD630_00010" ,"location=50..1320" ))

# Manipulation

n <- sapply(l, length)
seq.max <- seq_len(max(n))
df <-  t(sapply(l, "[", i = seq.max))
df <- t(apply(df,1,function(x){
  c(x[is.na(x)],x[!is.na(x)])}))
df <- df[rowSums(!is.na(df))>0, ]     
df[is.na(df)] <- "gen=unnamed"  

Output:

     [,1]          [,2]                     [,3]                     
[1,] "gene=dnaA"   "locus_tag=CD630_00010"  "location=1..1320"       
[2,] "gen=unnamed" "locusc_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA"   "locus_tag=CD630_00010"  "location=50..1320"      
Sign up to request clarification or add additional context in comments.

Comments

1

There are a number of methods for binding lists with unequal lengths. See bind_rows from dplyr, rbind.fill from plyr or rbindlist from data.table. Here is using base R

## Sample data
h <- list(letters[1:3],
          character(0),
          letters[4:5])

out <- do.call(rbind, lapply(h, `length<-`, 3))  # fix lengths and make matrix
out <- out[rowSums(!is.na(out))>0, ]             # remove empty rows
out[is.na(out)] <- "gen=unnamed"                 # rename NA

data.frame(out)
#   X1 X2          X3
# 1  a  b           c
# 2  d  e gen=unnamed

8 Comments

In your answer, everything seems to be pushed to the left when you are fixing the number of columns. How would you push everything to the right if you want the NA values to be in X1?
@Chani yes, that is a problem because the lists aren't named, so it is ambiguous which column they belong to when there are missing values. To always push right try do.call(rbind, lapply(h, function(x) rev(`length<-`(x, 3))))
I tried looking into rbindlist, as it is much faster on large lists. I'm trying rbindlist(lapply(h, function(x) rev(length<-(x, 3)))) however I keep getting an error Item 1 of list input is not a data.frame, data.table or list. When I check the class of lapply(h, function(x) rev(length<-(x, 3))) it returns list.
yea it should be fast. try rbindlist(lapply(h, function(x) as.list(rev(`length<-`(x, 3)))))
Haha, now it completely reverses the order of the columns. It now becomes location | locus | gene instead of gene | locus | location while correctly pushing everything to the right
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.