count number of part of string by columns

Question

I have a text file like this:

V1 V2   V3
X  N    aaaaaabbbabab
C  T    ababaaabaaabb
V  H    babbbabaabbba

What I want to do is count how much a and how much b there is in column of each V3.

So the output would be like this:

   col1  col2 col3 .......  col13
a  2     2    2             1
b  1     1    1             2

How this can be done?

I tried the count function along with sub-string, but it did not worked.

Thanks

Gavin Simpson · Accepted Answer · 2011-05-24 13:15:36Z

4

Assuming dat contains your data, we process using strsplit() to

tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)

giving:

> tt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "a"   "b"   "a"   "b"  
[2,] "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "a"   "a"   "b"   "b"  
[3,] "b"  "a"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"   "b"   "b"   "a"

We can get the desired results via, taking care to set the levels correctly:

apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))

which gives:

> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

To automate the selection of appropriate levels, we could do something like:

> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))), 
+       levels = lev)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

where in the first line we treat tt as a vector, and extract the levels after temporarily converting tt to a factor. We then supply these levels (lev) to the apply() step, instead of stating the levels explicitly.

edited May 24, 2011 at 13:15

answered May 24, 2011 at 12:54

Gavin Simpson

176k28 gold badges406 silver badges461 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

smack Over a year ago

@Gavin Simpson : can you please explain what you did please .

Gavin Simpson Over a year ago

@smack It is the same as @Joris to get tt. The difference is in how I use table(). It is important to get table() to count both "a" and "b" even when one of them is missing. The way to do this is to set the levels explicitly to c("a","b"). Is that sufficient or should I try to explain more?

smack Over a year ago

no , i think that it's sufficient , but in case i want to add a 3rd variable to the list , maybe "c" i can just add it to the levels right ?? and what can i use to plot these kind data frames ??

Gavin Simpson Over a year ago

@smack actually, we can simplify the last step because apply() returns a matrix if we get the counting in table() correct.

Gavin Simpson Over a year ago

@smack yup, just add "c" to the list of levels. If there are a lot of them, we could automate that step too, to select the correct levels.

|

Joris Meys · Accepted Answer · 2011-05-24 15:06:14Z

2

EDIT : solution corrected after comments of Gavin Simpson. This works now

To avoid many conversions to factor, you can use following trick with the indices and tapply :

tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")

ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)

> do.call(cbind,tapply(ttf,rep(1:n,k),table))
  1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2  2  1  1  1
b 1 1 1 2 1 0 2 2 1  1  2  2  2

Which gives a speedup of about 7 times to the method shown by @Gavin

> benchmark(method1(tt),method2(tt),replications=1)
         test replications elapsed relative user.self 
1 method1(tt)            1    0.89 1.000000      0.89   
2 method2(tt)            1    6.99 7.853933      6.98

edited May 24, 2011 at 15:06

answered May 24, 2011 at 12:39

Joris Meys

109k31 gold badges228 silver badges266 bronze badges

6 Comments

smack Over a year ago

@Joris Meys : both of the methods works , but both of them gave warning : Warning message: In function (..., deparse.level = 1) : number of rows of result is not a multiple of vector length (arg 1)

Joris Meys Over a year ago

@smack : then the data you gave is not the same as the one you have, as I don't get the warnings when I replace tt with df$V3. Which line gives you the warning?

smack Over a year ago

@Joris Meys : ok , i gave you an representative example of the data , but thanks it worked , but when i transform it to data frame , the numbers are gone and replaced by a and b , but i want to plot these numbers in a graph(x-axis : column number(1..13) and y-axis(number of a and b)) , how can i transform it without losing the numbers , and sorry for asking a lot but i'm new in R

Gavin Simpson Over a year ago

@Joris Actually, both are wrong. I came up with the same solution as your matrix one, and then realised that for "column" 6, which contains only "a" you get the wrong answer. Look at your results, it counts 3 "b"s as well as 3 "a"s, which can't be right - R is silently expanding the count for a. You need to set the correct levels in the table() call as per my Answer.

Joris Meys Over a year ago

@smack : Something is going wrong in your code, do not neglect it. If you get a warning, it doesn't work. Especially if you transform it to a dataframe, you get numbers.

|

Sacha Epskamp · Accepted Answer · 2011-05-24 12:57:17Z

0

Here is a new version to awnser the actual question. Still using gregexpr, but this time using the indexes. I have to go out of my way a bit to account for zero count cells (which I can't get in table?)

foo <- data.frame(
    V1 = c("X","C","V"),
    V2 = c("N","T","H"),
    V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))

n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)

res <- matrix(0,2,n)

res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB

rownames(res) <- c("a","b")
res
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

Without zerocount cells you could simply do rbind(tabA,tabB).

edited May 24, 2011 at 12:57

answered May 24, 2011 at 12:39

Sacha Epskamp

47.9k21 gold badges117 silver badges134 bronze badges

2 Comments

Joris Meys Over a year ago

That's not what OP is looking for... He's looking at columnwise comparisons.

smack Over a year ago

i think you got it wrong , i want to count a and b by column of the substring of V3 , not by row

Collectives™ on Stack Overflow

count number of part of string by columns

3 Answers 3

9 Comments

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related