I want to find all elements of an array a1 which items are not a part of array a2 and array a3.
For example:
$a1 = @(1,2,3,4,5,6,7,8)
$a2 = @(1,2,3)
$a3 = @(4,5,6,7)
Expected result:
8
k7s5a's helpful answer is conceptually elegant and convenient, but there's a caveat:
It doesn't scale well, because an array lookup must be performed for each $a1 element.
At least for larger arrays, PowerShell's Compare-Object cmdlet is the better choice:
If the input arrays are ALREADY SORTED:
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject
Note:
* Compare-Object doesn't require sorted input, but it can greatly enhance performance - see below.
* As Esperento57 points out, (Compare-Object $a1 ($a2 + $a3)).InputObject is sufficient in the specific case at hand, but only because $a2 and $a3 happen not to contain elements that aren't also in $a1.
Therefore, the more general solution is to use filter Where-Object SideIndicator -eq '<=', because it limits the results to objects missing from the LHS ($a1), and not also vice versa.
If the input arrays are NOT SORTED:
Explicitly sorting the input arrays before comparing them greatly enhances performance:
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) |
Where-Object SideIndicator -eq '<=').InputObject
The following example, which uses a 10,000-element array, illustrates the difference in performance:
$count = 10000 # Adjust this number to test scaling.
$a1 = 0..$($count-1) # With 10,000: 0..9999
$a2 = 0..$($count/2) # With 10,000: 0..5000
$a3 = $($count/2+1)..($count-3) # With 10,000: 5001..9997
$(foreach ($pass in 1..2) {
if ($pass -eq 1 ) {
$passDescr = "SORTED input"
} else {
$passDescr = "UNSORTED input"
# Shuffle the arrays.
$a1 = $a1 | Get-Random -Count ([int]::MaxValue)
$a2 = $a2 | Get-Random -Count ([int]::MaxValue)
$a3 = $a3 | Get-Random -Count ([int]::MaxValue)
}
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject, explicitly sorted first"
Timing = (Measure-Command {
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject"
Timing = (Measure-Command {
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { !($a2 + $a3).Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { ($a2 + $a3) -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
}
}) |
Group-Object TestCategory | ForEach-Object {
"`n=========== $($_.Name)`n"
$_.Group | Sort-Object Timing | Select-Object Test, @{ l='Timing'; e={ '{0:N3}' -f $_.Timing } }
}
Sample output from my machine (output of missing array elements omitted):
=========== SORTED input
Test Timing
---- ------
CompareObject 0.068
CompareObject, explicitly sorted first 0.187
!.Contains(), two-pass 0.548
-notcontains, two-pass 6.186
-notcontains, two-pass, explicitly sorted first 6.972
!.Contains(), two-pass, explicitly sorted first 12.137
!.Contains(), single-pass 13.354
-notcontains, single-pass 18.379
=========== UNSORTED input
CompareObject, explicitly sorted first 0.198
CompareObject 6.617
-notcontains, two-pass 6.927
-notcontains, two-pass, explicitly sorted first 7.142
!.Contains(), two-pass 12.263
!.Contains(), two-pass, explicitly sorted first 12.641
-notcontains, single-pass 19.273
!.Contains(), single-pass 25.174
While timings will vary based on many factors, you can get a sense that Compare-Object scales much better, if the input is either pre-sorted or sorted on demand, and the performance gap widens with increasing element count.
When not using Compare-Object, performance can be somewhat increased - but not being able to take advantage of sorting is the fundamentally limiting factor:
Neither -notcontains / -contains nor .Contains() can take full advantage of presorted input.
If the input is already sorted: Using the .Contains() IList interface .NET method rather than the PowerShell -contains / -notcontains operators (which an earlier version of k7s5a's answer used) improves performance.
Joining arrays $a2 and $a3 once, up front, and then using the joined array in the pipeline improves performance (that way, the arrays don't have to be joined in every iteration).