Create sequence of values based on multiple column values in R

Question

I have a data.frame that is a result of a near neighbor search of points, and it has three columns: V1 represents the index of closest point, V2 the second closest point, and V3 the third:

search_result <- structure(list(V1 = c(1350L, 1390L, 1411L, 1437L, 1444L, 1895L, 
                                       1895L, 1467L, 1478L, 1500L), 
                                V2 = c(1351L, 1391L, 1410L, 1438L, 
                                       1907L, 1456L, 1456L, 1466L, 1477L, 1499L), 
                                V3 = c(1349L, 1389L, 1940L, 1913L, 1445L, 1894L, 
                                       1894L, 1884L, 1479L, 1501L)), 
                           row.names = c(NA, -10L), 
                           class = "data.frame")

As I want the closest neighbor point, I would select V1 as the result and I would be fine. It happens that I also want the index to be ordered, and V1 has some index that are out of order. So I want to create a column that will give me the value of V1 (when it's in order) or the value of V2 or V3 (and V2 has the priority) so the order is preserved. In this case the result would be like:

     V1   V2   V3 ordered
1  1350 1351 1349    1350
2  1390 1391 1389    1390
3  1411 1410 1940    1411
4  1437 1438 1913    1437
5  1444 1907 1445    1444
6  1895 1456 1894    1456 #take V2 instead
7  1895 1456 1894    1456 #take V2 instead
8  1467 1466 1884    1467
9  1478 1477 1479    1478
10 1500 1499 1501    1500

I tried to take the minimum value of each column, but there are cases later on the dataset which the max value would be the desired (not the best option, but closer to the expected). In the example below, there is discontinuity on rows 2, 4, 5 and 6, so I would take the value of V2 (priority) or V3 as the desired, so the "order" is maintained:

# it's harder to see the "order" here, but it starts in V1 = 1881

   V1   V2   V3  ordered
1 1881 1470 1880    1881
2 1457 1893 1894    1893 #take V2 instead
3 1907 1444 1906    1907
4 1442 1443 1908    1908 #take V3 instead
5 1433 1918 1432    1918 #take V2 instead
6 1402 1949 1401    1949 #take V2 instead
7 1968 1969 1967    1968
8 1985 1986 1984    1985
9 1992 1993 1991    1992

The full dataset has 2500 points, and the "unordered" values happen in roughly 10% of it, so I can estimate what's the "order".

Any base tidyverse or data.table help would be appreciated. Thanks!

Would it suffice to take the minimum of the columns? That seems to fit for your example data but I don't know if that will work more generally. If not, is there a consistent mathematical way to identify which rows are "out of order"? — Jon Spring
– Jon Spring, Commented Jul 18, 2019 at 22:54
I added some cases where the minimum would not suffice. I can identify (using diff) where probably there is a discontinuity, but it would also identify some "ok" values as suspicious, specially when there is a transition from a bad to an ok value. — Kauê Braga
– Kauê Braga, Commented Jul 18, 2019 at 23:13
It seems like there could be multiple possible solutions in some cases, e.g. couldn't V2 values be used in row 8+9 and still be "in order"? What if you could fix a V1 out-of-order with V2 in one row and V3 in another, or alternately V3 and V2 -- which match is better? — Jon Spring
– Jon Spring, Commented Jul 18, 2019 at 23:17
What happens if at some point none of V1:V3 are greater or equal to the ordered value from the previous row? — Alexis
– Alexis, Commented Jul 19, 2019 at 7:47
In the first example, row 6, why would you take V2 if 1895 > 1444 (i.e., order is still increasing)? — Alexis
– Alexis, Commented Jul 19, 2019 at 15:17

Marcus · Accepted Answer · 2019-07-19 17:11:41Z

2

It sounds like what you want to do is iterate over each column returned by the search and first each row, keeping the first value that satisfies the indices being in order.

Start with assuming the first column is in order. Move to the second column and replace any rows where this is not true. Move to the third column, comparing to your updated ordered column. Continue for all the columns.

There may be a more optimized way of coding this (such as checking if the answer converges prior to iterating all of the columns) but here is a compact way to achieve this (note the lag function is dplyr::lag not stats::lag) :

library(dplyr)
library(purrr)

# using the second data set
# assuming at least one column will satisfy the constraints
data.frame(
  V1 = c(1881, 1457, 1907, 1442, 1433, 1402, 1968, 1985, 1992),
  V2 = c(1470, 1893, 1444, 1443, 1918, 1949, 1969, 1986, 1993),
  V3 = c(1880, 1894, 1906, 1908, 1432, 1401, 1967, 1984, 1991)
) %>%
  dplyr::mutate(
    ordered = reduce(., ~ifelse(.x >= lag(.x, default = 0), .x, .y))
  )

#>     V1   V2   V3 ordered
#> 1 1881 1470 1880    1881
#> 2 1457 1893 1894    1893
#> 3 1907 1444 1906    1907
#> 4 1442 1443 1908    1908
#> 5 1433 1918 1432    1918
#> 6 1402 1949 1401    1949
#> 7 1968 1969 1967    1968
#> 8 1985 1986 1984    1985
#> 9 1992 1993 1991    1992

If you're not sure if you've returned enough columns from the nearest neighbor search, you'll have to add one more iteration to check if the ordered column is ascending

search_results <- data.frame(
  V1 = c(1881, 1457, 1907, 1442, 1433, 1402, 1968, 1785, 1992),
  V2 = c(1470, 1893, 1444, 1443, 1918, 1949, 1969, 1786, 1993),
  V3 = c(1880, 1894, 1906, 1908, 1432, 1401, 1967, 1784, 1991)
) %>%
  dplyr::mutate(
    ordered = reduce(., ~ifelse(.x >= lag(.x, default = 0), .x, .y))
  )

with(search_results, any(ordered < lag(ordered, default = 0)))
#> [1] TRUE

^{Created on 2019-07-19 by the reprex package (v0.3.0)}

answered Jul 19, 2019 at 17:11

Marcus

3,6561 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kauê Braga Over a year ago

Thanks for the reduce! It doesn't work in the first example though.

Kauê Braga Over a year ago

First, in rows 5-6 there is clearly a "discontinuity", but it's still ordered. So I thought I could set a threshold (lets say 100) to determine if there was a discontinuity or not: ordered = reduce(., ~ifelse(abs(.x - lag(.x, default = 0)) <= 100, .x, .y). It works in this situation, but it also label the transition from rows 7-8 as a discontinuity (and doesn't label row 7 as unordered, as it is the case)

Marcus Over a year ago

Can you clarify your logic on selecting the value for the ordered column? From your comment above is it 1) the values have to be in ascending order 2) and the first value that is not a discontinuity (as determined by a threshold)? The second point meaning that the selected value doesn't have to be the absolute closest to the previous value, just within a threshold. I.e. this is why for the second data set, row three 1907 is selected instead of 1906 (even though the latter is closer to 1893)

Ronak Shah · Accepted Answer · 2019-07-24 01:17:52Z

Since V1 should always be increasing, we can take first value of V1 as reference and subtract all the values from 2nd row by this first_value and take the one which gives the minimum difference. Since, we also want to consider priority one way is to multiply the difference by incremental number. In this example, I have just multiplied it by integers 1, 2 and 3. So the first difference is multiplied by 1, second by 2 and so on. More complex methods can be thought of to assign priority if some edge case are found.

first_value <- search_result$V1[1]
search_result$ordered <- c(first_value, apply(search_result[-1, ], 1, function(x) {
     x <- x[x > first_value]
     x[which.min((x - first_value) * seq_along(x))]
}))

search_result
#     V1   V2   V3 ordered
#1  1350 1351 1349    1350
#2  1390 1391 1389    1390
#3  1411 1410 1940    1411
#4  1437 1438 1913    1437
#5  1444 1907 1445    1444
#6  1895 1456 1894    1456
#7  1895 1456 1894    1456
#8  1467 1466 1884    1467
#9  1478 1477 1479    1478
#10 1500 1499 1501    1500

This also works for the second dataset, consider it as df

first_value <- df$V1[1]
df$ordered <- c(first_value, apply(df[-1, ], 1, function(x) {
     x <- x[x > first_value]
     x[which.min((x - first_value) * seq_along(x))]
}))

df
#    V1   V2   V3 ordered
#1 1881 1470 1880    1881
#2 1457 1893 1894    1893
#3 1907 1444 1906    1907
#4 1442 1443 1908    1908
#5 1433 1918 1432    1918
#6 1402 1949 1401    1949
#7 1968 1969 1967    1968
#8 1985 1986 1984    1985
#9 1992 1993 1991    1992

Collectives™ on Stack Overflow

Create sequence of values based on multiple column values in R

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related