Fastest way to find unique values in an array

Question

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.

Right now I have two solutions:

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray is equivalent to :

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2]. The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?

EDIT

I wanted to time the various solutions.

clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
   FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
   FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
   FCT3 = unique(dataArray(:, 1));
   FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
   FCT32 = unique(dataArray(:, 1));
   FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
   FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
   FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
   FCT5 = FCT5(FCT5>0);
end
toc

And the results:

Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

@chappjc you are right, the 'rows' is not necessary. I removed it. — m_power
– m_power, Commented Dec 18, 2013 at 20:44
You might want to try the timings for longer dataArray rather than more iterations. But just for kicks, throw in FCT3 = FCT3(FCT3~=0); since that is usually faster than assigning [];. — chappjc
– chappjc, Commented Dec 18, 2013 at 20:48
Agree with @chappjc - your timing is dominated by having either one or two function calls. Try this on a "big" array (see my answer for an example). You are mostly timing the loop (and function call) overhead with your current code, rather than the efficiency of the functions once you are inside them. — Floris
– Floris, Commented Dec 18, 2013 at 20:54
@chappjc 2nd solution appears to be the fastest, for now :D. — m_power
– m_power, Commented Dec 18, 2013 at 21:20
@chappjc inspired by your last comment I updated my answer one more time... — Floris
– Floris, Commented Dec 18, 2013 at 22:31

Community · Accepted Answer · 2017-05-23 12:08:09Z

6

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!

Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.

Original suggestion: Combine unique with setdiff:

result = setdiff(unique(a(:,1)),0)

Or remove with logical indexing after unique:

result = unique(a(:,1));
result = result(result>0);

I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.

Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

edited May 23, 2017 at 12:08

CommunityBot

11 silver badge

answered Dec 18, 2013 at 20:36

chappjc

30.6k7 gold badges79 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

m_power Over a year ago

When using a 10000x10 I get the following error : ??? Error using ==> accumarray Maximum variable size allowed by the program is exceeded.

chappjc Over a year ago

@m_power Needs to use just the first column (a(:,1)) like the other solutions. Updated.

Floris Over a year ago

Wacky, maybe. But FAST!

chappjc Over a year ago

@m_power I'd reconsider the problem based on what the expected values are. Are they continuous or will they be certain values. There are problems using unique on double as well although it will let you run it. Beware.

Luis Mendo Over a year ago

@chappjc +1, but why not remove the zero elements before accumarray?: result = find(accumarray(nonzeros(a(:,1))+1,1)-1);. It takes slighlty less time on my computer.

|

Floris · Accepted Answer · 2013-12-18 22:31:15Z

5

Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:

a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc

display(b1)
display(b2)
display(b3)

On my machine, the timings (for an array of 100000 elements) were as follows:

0.0087 s  - for unique
0.0142 s  - for find(sparse)
0.0302 s  = for setxor

I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.

EDIT per @chappj's suggestion. I added a fourth option

b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];

Time:

0.0029 s , THREE TIMES FASTER THAN UNIQUE

Ladies and gentlemen, we have a winner.

AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did

a = 0.001 * double(int(a * 1000));

You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do

a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);

This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)

edited Dec 18, 2013 at 22:31

answered Dec 18, 2013 at 20:51

Floris

46.6k7 gold badges73 silver badges128 bronze badges

6 Comments

chappjc Over a year ago

I prefer accumarray to sparse. Wanna add the timings. ;)

Floris Over a year ago

@chappjc that was a terrific suggestion. It is 3x faster (on my machine)!

chappjc Over a year ago

<bowing>Thank you, thank you.</bowing> But it's likely only a matter of time until a faster one comes along. ;)

m_power Over a year ago

Your third solution is working correctly, however I can't use it for what I'm doing. I need to have same exact values, not rounded values.

chappjc Over a year ago

@m_power Could you post some actual data on dropbox or somewhere similar. I'm skeptical that you will get a satisfactory solution with anything that does not use tolerances, including unique. Testing floating point values for equality rarely works as desired.

|

Pursuit · Accepted Answer · 2013-12-19 07:59:29Z

3

Why not remove the zeros as a second step:

result2 = unique(.....);
result2 = (result2~=0);

edited Dec 19, 2013 at 7:59

answered Dec 18, 2013 at 20:35

Pursuit

12.4k1 gold badge28 silver badges42 bronze badges

1 Comment

Dan Over a year ago

result2 = result2(result2 ~= 0) might be faster: stackoverflow.com/questions/12421345/…

m_power · Accepted Answer · 2013-12-18 20:43:04Z

0

I also found another way to do it :

result2 = unique(dataArray(dataArray(:,1)>0,1));

answered Dec 18, 2013 at 20:43

m_power

3,2345 gold badges36 silver badges55 bronze badges

2 Comments

chappjc Over a year ago

Just a tip: 1:end is the same as just :. But, yeah, this is a workable solution.

Floris Over a year ago

Strictly speaking you should use dataArray(:,1)~=0 to allow for the possibility of negative numbers. I suspect thought that it's faster to remove the extra zero at the end (smaller array to manipulate).

Collectives™ on Stack Overflow

Fastest way to find unique values in an array

4 Answers 4

8 Comments

6 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

6 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related