4

I have a text file like this:

V1 V2   V3
X  N    aaaaaabbbabab
C  T    ababaaabaaabb
V  H    babbbabaabbba

What I want to do is count how much a and how much b there is in column of each V3.

So the output would be like this:

   col1  col2 col3 .......  col13
a  2     2    2             1
b  1     1    1             2

How this can be done?

I tried the count function along with sub-string, but it did not worked.

Thanks

0

3 Answers 3

4

Assuming dat contains your data, we process using strsplit() to

tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)

giving:

> tt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "a"   "b"   "a"   "b"  
[2,] "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "a"   "a"   "b"   "b"  
[3,] "b"  "a"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"   "b"   "b"   "a"

We can get the desired results via, taking care to set the levels correctly:

apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))

which gives:

> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

To automate the selection of appropriate levels, we could do something like:

> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))), 
+       levels = lev)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

where in the first line we treat tt as a vector, and extract the levels after temporarily converting tt to a factor. We then supply these levels (lev) to the apply() step, instead of stating the levels explicitly.

Sign up to request clarification or add additional context in comments.

9 Comments

@Gavin Simpson : can you please explain what you did please .
@smack It is the same as @Joris to get tt. The difference is in how I use table(). It is important to get table() to count both "a" and "b" even when one of them is missing. The way to do this is to set the levels explicitly to c("a","b"). Is that sufficient or should I try to explain more?
no , i think that it's sufficient , but in case i want to add a 3rd variable to the list , maybe "c" i can just add it to the levels right ?? and what can i use to plot these kind data frames ??
@smack actually, we can simplify the last step because apply() returns a matrix if we get the counting in table() correct.
@smack yup, just add "c" to the list of levels. If there are a lot of them, we could automate that step too, to select the correct levels.
|
2

EDIT : solution corrected after comments of Gavin Simpson. This works now


To avoid many conversions to factor, you can use following trick with the indices and tapply :

tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")

ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)

> do.call(cbind,tapply(ttf,rep(1:n,k),table))
  1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2  2  1  1  1
b 1 1 1 2 1 0 2 2 1  1  2  2  2

Which gives a speedup of about 7 times to the method shown by @Gavin

> benchmark(method1(tt),method2(tt),replications=1)
         test replications elapsed relative user.self 
1 method1(tt)            1    0.89 1.000000      0.89   
2 method2(tt)            1    6.99 7.853933      6.98     

6 Comments

@Joris Meys : both of the methods works , but both of them gave warning : Warning message: In function (..., deparse.level = 1) : number of rows of result is not a multiple of vector length (arg 1)
@smack : then the data you gave is not the same as the one you have, as I don't get the warnings when I replace tt with df$V3. Which line gives you the warning?
@Joris Meys : ok , i gave you an representative example of the data , but thanks it worked , but when i transform it to data frame , the numbers are gone and replaced by a and b , but i want to plot these numbers in a graph(x-axis : column number(1..13) and y-axis(number of a and b)) , how can i transform it without losing the numbers , and sorry for asking a lot but i'm new in R
@Joris Actually, both are wrong. I came up with the same solution as your matrix one, and then realised that for "column" 6, which contains only "a" you get the wrong answer. Look at your results, it counts 3 "b"s as well as 3 "a"s, which can't be right - R is silently expanding the count for a. You need to set the correct levels in the table() call as per my Answer.
@smack : Something is going wrong in your code, do not neglect it. If you get a warning, it doesn't work. Especially if you transform it to a dataframe, you get numbers.
|
0

Here is a new version to awnser the actual question. Still using gregexpr, but this time using the indexes. I have to go out of my way a bit to account for zero count cells (which I can't get in table?)

foo <- data.frame(
    V1 = c("X","C","V"),
    V2 = c("N","T","H"),
    V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))

n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)

res <- matrix(0,2,n)

res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB

rownames(res) <- c("a","b")
res
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

Without zerocount cells you could simply do rbind(tabA,tabB).

2 Comments

That's not what OP is looking for... He's looking at columnwise comparisons.
i think you got it wrong , i want to count a and b by column of the substring of V3 , not by row

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.