Finding and counting strings using multiple index vectors

Question

I have a character array (this can also be stored as a cell array if more useful) (list) and wish to tally the number of substring occurrences against two different indexes held in two separate variables type and ind.

list =
C C N N C U C N N N C N U N C N C

ind =
1 1 2 2 2 3 3 3 4 1 1 2 3 3 3 4 4 

type = 
15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16

No spaces exist in the character array - added for clarity.

Using the above example, the desired output would tally all instances of unique letters in list, for each ind and for each type - creating three columns (for C/N/U), each with 4 rows (for each ind) - per type. This is done using the order in which the entries in each array appear.

Desired output of above example (the labels are added for clarity only):

            Type 15              Type 16
   Ind  C      N      U      C      N      U
    1   2      0      0      1      1      0
    2   1      2      0      0      1      0
    3   1      1      1      1      1      1
    4   0      1      0      1      1      0

I am only aware of how to do this with a single index (using unique, full and sparse).

How can I bet go about doing this with a dual index?

No - as shown in this example, it always contains three (N,C and U). — AnnaSchumann
– AnnaSchumann, Commented Aug 22, 2015 at 10:01
@AnnaSchumann I added a solution using crosstab, seems to be the most appropriate for me. — Robert Seifert
– Robert Seifert, Commented Aug 22, 2015 at 10:46

Robert Seifert · Accepted Answer · 2015-08-22 10:49:54Z

One possibility could be to transform your letters to doubles by substracting e.g. -64 to map the number 3 to the letter C.

Then you can use unique with 'rows' and 'stable', to get the following result:

list = char('CCNNCUCNNNCNUNCNC')
ind = [1 1 2 2 2 3 3 3 4 1 1 2 3 3 3 4 4]
type = [15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16]

data = [type(:) ind(:) (list(:) - 64)]
[a,~,c] = unique(data,'rows','stable')
occ = accumarray(c,ones(size(c)),[],@numel)

output = [a, occ]

output =

    15     1     3     2
    15     2    14     2
    15     2     3     1
    15     3    21     1
    15     3     3     1
    15     3    14     1
    15     4    14     1
    16     1    14     1
    16     1     3     1
    16     2    14     1
    16     3    21     1
    16     3    14     1
    16     3     3     1
    16     4    14     1
    16     4     3     1

If you have the Statistics Toolbox you should consider using grpstats.

If you don't mind a mind twisting output then crosstab is the far easiest solution:

output = crosstab(type(:),ind(:),list(:)-64)

%// type in downwards, ind to the right
output(:,:,1) =   %// 'C'

     2     1     1     0
     1     0     1     1


output(:,:,2) =   %// 'N'

     0     2     1     1
     1     1     1     1


output(:,:,3) =  %// 'U'

     0     0     1     0
     0     0     1     0

The following one liner looks close like your desired output:

output2 = reshape(crosstab(ind(:),list(:)-64,type(:)),4,[],1)

output2 =

     2     0     0     1     1     0
     1     2     0     0     1     0
     1     1     1     1     1     1
     0     1     0     1     1     0

Also in this toolbox, you can find the tabulate function which offers another option in combination with accumarray:

[~,~,c] = unique([type(:) ind(:)],'rows','stable')
output = accumarray(c(:),list(:),[],@(x) {tabulate(x)} )

Which also allows the following output:

d = unique([type(:) ind(:) list(:)-64],'rows','stable')
output2 = [num2cell(d(:,[1,2])) vertcat(output{:})]

output2 = 

    [15]    [1]    'C'    [2]    [    100]
    [15]    [2]    'N'    [2]    [66.6667]
    [15]    [2]    'C'    [1]    [33.3333]
    [15]    [3]    'U'    [1]    [33.3333]
    [15]    [3]    'C'    [1]    [33.3333]
    [15]    [3]    'N'    [1]    [33.3333]
    [15]    [4]    'N'    [1]    [    100]
    [16]    [1]    'N'    [1]    [     50]
    [16]    [1]    'C'    [1]    [     50]
    [16]    [2]    'N'    [1]    [    100]
    [16]    [3]    'U'    [1]    [33.3333]
    [16]    [3]    'N'    [1]    [33.3333]
    [16]    [3]    'C'    [1]    [33.3333]
    [16]    [4]    'N'    [1]    [     50]
    [16]    [4]    'C'    [1]    [     50]

Brilliant answer. +1 for crosstab and the one-liner. Very concise and simple. Thank you very much.

Adriaan · Accepted Answer · 2015-08-22 08:41:04Z

0

Use accumarray:

Output = accumarray([type',ind'],list');

Could be you need to convert type and list to numbers first using str2num and then use accumarray and transform the result back to numbers using num2str.

answered Aug 22, 2015 at 8:41

Adriaan

18.2k7 gold badges47 silver badges88 bronze badges

1 Comment

AnnaSchumann Over a year ago

I'm having problems implementing this approach due to the data types at hand. I've tried to simplify this by using simple vectors for ind and type that contain purely numerical data. However 'list' must either be a cell array or character array. str2num returns an empty cell array when I attempt to convert it for use with accumarray.

Collectives™ on Stack Overflow

Finding and counting strings using multiple index vectors

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related