1

I'm currently trying to filter a dataset containing audio data on bird species. The data looks like this:

head(audiomoth_sample)

id park park_abbr am_no sci_name com_name start_s end_s conf date_time
102 Appelscha AA A1 Turdus_merula Eurasian Blackbird 26 29 0.5305 2025-04-29 16:55:00
103 Appelscha AA A1 Turdus_merula Eurasian Blackbird 27 30 0.3356 2025-04-29 16:55:00
104 Appelscha AA A1 Turdus_merula Eurasian Blackbird 28 31 0.1958 2025-04-29 16:55:00
105 Appelscha AA A1 Oriolus oriolus Eurasian Golden Oriole 31 34 0.1293 2025-04-29 16:55:00
106 Appelscha AA A1 Turdus_merula Eurasian Blackbird 31 34 0.1056 2025-04-29 16:55:00
107 Appelscha AA A1 Turdus_merula Eurasian Blackbird 32 35 0.3121 2025-04-29 16:55:00

In total I'm dealing with 31 parks, 10 recorders each (am_no column, "A1" to "A5" and "B1" to "B5"), and 124 bird species. You wont be seeing any other am_no apart from A1 as I've been working with a subset of the first 100 detections per park (so 3100 in total). The data in total is over 19 million observations, and was analysed in BirdNET in 3 second snippets, with 2 second overlap; so "start_s" is the starting second and "end_s" is the ending second of the respective snippet.

What I would like to do is combine these snippets into singing events, where all rows where a bird keeps singing, at the same park and recorder, uniterrupted (interruption is a gap of > 20 seconds, though this might change at a later date) are combined. In such a case, I'd want to retain the data of the first row, but add the end_s of the last and take the average "conf" (confidence level) of all rows inbetween including the first and last.

So far, I've split the audiomoth_sample dataframe into a list of 124 per-species dataframes to avoid two birds of different species singing at the same time resulting in a split in singing events. For example, in the table above, the Golden Oriole at id 105 would result in the Blackbird having two singing events, from id 102-104 and 106 onwards, despite the gap being less than 20 seconds.

I've managed to write a function that gets me most of the way to where I need to be, but I run into something weird in the output. Below is an example for the Eurasian Blackbird subset. I've left out the columns park and com_name since they are functionally identical to park_abbr and sci_name , and date_time since I haven't used it in the function thusfar; it's only the start-datetime of the original AudioMoth recording.

# loading library
library(tidyverse)



# recreating blackbird data
blackbird <- structure(list(id = c(5801137L, 5801138L, 5801139L, 5801146L, 
5801147L, 5801148L, 5801149L, 5801150L, 5801154L, 5801155L, 5801156L, 
5801157L, 5801158L, 5801168L, 5801178L, 5801188L, 5801189L, 5801190L, 
5801191L, 5801192L, 5801193L, 5801194L, 5801195L, 5801196L, 5801197L, 
5801198L, 5801200L, 5801202L, 5801203L, 5801204L, 5801206L, 5801208L, 
5801211L, 5801215L, 5801217L, 5801219L, 5801220L, 5801221L, 5801222L, 
5801223L, 5801224L, 5801230L, 5801231L, 5801232L, 5801235L, 10399185L, 
10399202L, 13435015L, 13435017L, 13435018L, 13435019L, 13435020L, 
13435021L, 13435022L, 13435023L, 13435024L, 13435025L, 13435026L, 
13435027L, 13435028L, 13435030L, 13435032L, 13435033L, 14544238L, 
14544245L), park_abbr = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L), levels = c("GV", "NH", 
"RO", "TE"), class = "factor"), am_no = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = "A1", class = "factor"), 
    sci_name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = "Aegithalos caudatus", class = "factor"), 
    start_s = c(0L, 1L, 2L, 6L, 7L, 8L, 9L, 10L, 14L, 15L, 16L, 
    17L, 18L, 25L, 32L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 
    50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 61L, 62L, 
    63L, 64L, 65L, 66L, 67L, 68L, 73L, 74L, 75L, 76L, 103L, 122L, 
    732L, 733L, 734L, 742L, 743L, 744L, 745L, 746L, 747L, 748L, 
    749L, 750L, 751L, 752L, 753L, 754L, 339L, 495L), end_s = c(3L, 
    4L, 5L, 9L, 10L, 11L, 12L, 13L, 17L, 18L, 19L, 20L, 21L, 
    28L, 35L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
    55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 64L, 65L, 66L, 67L, 
    68L, 69L, 70L, 71L, 76L, 77L, 78L, 79L, 106L, 125L, 735L, 
    736L, 737L, 745L, 746L, 747L, 748L, 749L, 750L, 751L, 752L, 
    753L, 754L, 755L, 756L, 757L, 342L, 498L), conf = c(0.2234, 
    0.5614, 0.4781, 0.1463, 0.3758, 0.5148, 0.7357, 0.5954, 0.3679, 
    0.8198, 0.5869, 0.82, 0.3683, 0.3678, 0.1484, 0.9014, 0.9899, 
    0.9958, 0.9964, 0.9957, 0.9467, 0.9939, 0.9917, 0.9915, 0.9533, 
    0.7444, 0.8523, 0.9729, 0.9253, 0.6424, 0.8775, 0.5058, 0.3696, 
    0.8907, 0.8253, 0.9798, 0.8965, 0.9547, 0.7425, 0.9277, 0.7885, 
    0.9165, 0.5399, 0.5519, 0.1256, 0.1927, 0.2104, 0.546, 0.1157, 
    0.1264, 0.4881, 0.8465, 0.5787, 0.9218, 0.7392, 0.9863, 0.9783, 
    0.9544, 0.2077, 0.1305, 0.6694, 0.1298, 0.8092, 0.1887, 0.3975
    )), row.names = c(NA, -65L), class = "data.frame")
# setting and running the function 
clean.dataset <- function(df) {
  cleaned_df <- data.frame()
  last_park <- NULL
  last_am_no <- NULL
  last_start <- NULL
  last_end <- NULL
  last_conf <- NULL
  keeping_count <- NULL
  for (i in 1:nrow(df)) {
    park <- df$park_abbr[i]
    am_no <- df$am_no[i]
    start <- df$start_s[i]
    end <- df$end_s[i]
    conf <- df$conf[i]
    if (
      is.null(last_start)
    ) {
      cleaned_df <- rbind(cleaned_df, df[i, ])
      last_park <- park
      last_am_no <- am_no
      last_start <- start
      last_end <- end
      last_conf <- conf
      keeping_count <- 1
    } else {
      if (
        start - last_end < 20 && last_park == park && last_am_no == am_no
      ) {
        last_end <- end
        last_conf <- last_conf + conf
        keeping_count <- keeping_count + 1
      } else {
        if (park != last_park || last_am_no != am_no) {
          cleaned_df <- rbind(cleaned_df, df[i, ])
          last_conf <- last_conf / keeping_count
          cleaned_df[i - keeping_count, 6] <- last_end # 6 was 8 to match full data
          cleaned_df[i - keeping_count, 7] <- last_conf # 7 was 9 to match full data
          last_park <- park
          last_am_no <- am_no
          last_start <- start
          last_end <- end
          last_conf <- conf
          keeping_count <- 1
        } else {
          cleaned_df <- rbind(cleaned_df, df[i, ])
          last_conf <- last_conf / keeping_count
          cleaned_df[i - keeping_count, 6] <- last_end # 6 was 8 to match full data
          cleaned_df[i - keeping_count, 7] <- last_conf # 7 was 9 to match full data
          last_park <- park
          last_am_no <- am_no
          last_start <- start
          last_end <- end
          last_conf <- conf
          keeping_count <- 1
        }
      }
    }
  }
  return(cleaned_df)
}


clean.dataset(blackbird)

From this subset, I expected a data.frame that looks like this:

id park_abbr am_no sci_name start_s end_s conf
5801137 GV A1 Aegithalos caudatus 0 79 0.7088022
10399185 NH A1 Aegithalos caudatus 103 125 0.2104
13435015 RO A1 Aegithalos caudatus 122 757 0.57675
14544238 TE A1 Aegithalos caudatus 339 342 0.1887
14544245 TE A1 Aegithalos caudatus 495 498 0.3975

Instead, I get this. Apologies for the pictures but I didn't know how to capture this otherwise. As you can see, The first row is correct, but the second two have incorrect end times and confidence levels; they are just the ones from the raw data. They are followed by a lot of rows filled with just NA's . This continues until for a while, row 28 to 37 are also filled with NA's. The second section of the table is mostly NA's too, but also contains rows named 46.1, 48.1, and 64.1, which contain only the correct end time and confidence values which should've been in rows 46, 48 and 64 respectively.

I've checked some of the other large sub-dataframes and they have the same issues. All of the NA rows with no data in them aren't really a big deal, as I could just filter those out with blackbird <- blackbird[complete.cases(blackbird), ]. However, the "x.1" rows with the correct data in them is something I don't understand and haven't been able to find anything about either.

I suspect my issue lies somewhere in the keeping_count variable I've used to divide the sum-total of all the last_conf variables by. I think something is going wrong when I use cleaned_df[i - keeping_count, 6] and cleaned_df[i - keeping_count, 7] to assign the new values to an already written row. I've tried adding a variable write <- 1 to the function that gets a write <- write +1 whenever keeping_count gets reset to 1 or gets a +1. Using that instead of keeping_count whenever I need to index a row in the dataframe (e.g. cleaned_df[i - write, 6]) seems to get rid of the NA rows and the "x.1" rows, at least for the blackbird subset, but results in incorrect data being written in the end_s and conf columns instead.

Does anyone know why the "x.1" rows are showing up instead of the data getting written to where I want it? How would I get the data in the right place?

4
  • You might want to share sampled rows of your data, e.g. 75, so we can include more groups than currently by re-sampling back to 19 millions rows. Commented Aug 26 at 22:25
  • As to the decimal row names, you presumably have a bug somewhere where you're adding a duplicate row, such as how row 46 becomes your 2nd row and then it gets added again as your 46th row. See stackoverflow.com/questions/60761717/… Commented Aug 27 at 16:52
  • Stepping through your loop, cleaned_df is populated with new rows when i is 1, and 46, corresponding to new values of park_abbr. When i is 48, you bind in row 48, making cleaned_df be three rows with row names 1, 46, and 48. But then your line cleaned_df[i - keeping_count, 6] <- last_end tells R to put the value 735 (last_end) into the 46th (i of 48 minus keeping_count of 2) row of cleaned_df that only had 3 rows. The new rows have sequential row names, so row 46 is a repeat of your 2nd row name, so it is assigned row 46.1 to avoid duplication. Commented Aug 27 at 17:24
  • (All to say I encourage you to review some materials about R programming, like r4ds.hadley.nz. I'm guessing you might have prior experience with languages like VBA, which sometimes force you to do this sort of onerous bookkeeping. R has powerful features like vectorization which make it performant even though it's a high-level, interpreted language. Code like what you started with misses out both on the abstraction that a high-level language allows, as well as the performance that a low-level language like C might provide if coded it like this.) Commented Aug 27 at 17:32

2 Answers 2

8

Since you loaded tidyverse, I might suggest a simpler dplyr approach that relies on grouped data, rather than spinning a customized manual bookkeeping using row-by-row loops. I wouldn't be surprised if calculation speed increased by 1000x or more, since a row-by-row loop leaves a lot of performance on the table, see https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/. And more importantly, I think this approach is much easier to read, understand, debug, and update.

This should also eliminate the need to split your data by bird, since that can be considered a group as well.

Since 2023's dplyr 1.1.0, we've had .by, which allows us to specify groupings within the mutate/summarize/filter/etc. step, an improvement on the prior group_by(...) |> [do thing] |> ungroup() syntax.

Here, I arrange by time (plus other defining groupings for readability -- it's functionally only important to sort by time, so the lag step works correctly within each group). Then we can make a flag variable gap when a start time exceeds 20 seconds after the preceding end time within that park-sensor-bird group, and we can use those to make a group variable that matches for contiguous readings. Then we can summarize within those groupings to get the output you sought:

blackbird |>
  arrange(park_abbr, am_no, sci_name, start_s) |>
  # gap if start_s >20s after prior end_s in grp, default -100 [for 1st rows]
  mutate(gap = start_s > lag(end_s,1,-100) + 20,
         group = cumsum(gap),
         .by = c(park_abbr, am_no, sci_name)) |>
  summarize(id = min(id),
            start_s = min(start_s),
            end_s = max(end_s),
            conf = mean(conf),
            .by = c(park_abbr, am_no, sci_name, group))
  park_abbr am_no            sci_name group       id start_s end_s      conf
1        GV    A1 Aegithalos caudatus     1  5801137       0    79 0.7088022
2        NH    A1 Aegithalos caudatus     1 10399185     103   125 0.2015500
3        RO    A1 Aegithalos caudatus     1 13435015     732   757 0.5767500
4        TE    A1 Aegithalos caudatus     1 14544238     339   342 0.1887000
5        TE    A1 Aegithalos caudatus     2 14544245     495   498 0.3975000

dplyr's performance may get bogged down (but maybe still acceptably fast?) with 19 million rows, especially if it's subdivided into many groups. My own personal rule of thumb is to look at alternatives if the number of groups exceeds 10,000 -- at that point high-performance data manipulation packages like data.table, duckdb, collapse, arrow, or polars can be useful for improving performance. Here's a benchmarking that shows how for 10M-row data, dplyr takes a hit when there are many (e.g. 100k) groups: https://h2oai.github.io/db-benchmark/#gb

Fortunately for us, people have made wrappers which allow us to get most of the performance benefits of those packages while using dplyr syntax, which I'm most comfortable with.

So, for example, you could start with blackbird |> dtplyr::lazy_dt() |> to automatically translate the dplyr code to data.table. (This won't work for every function, but for these vanilla ones it does fine.)

Or you could use the duckdb wrapper duckplyr. As of v1.1.1, duckplyr didn't (to my understanding) know how to use factor variables for groupings, so I convert the sci_name to simple text and then it works for me: blackbird |> mutate(sci_name = as.character(sci_name)) |> duckplyr::as_duck_tbl() |> ...

I expect either of those last two approaches should work pretty well with your full data.

Sign up to request clarification or add additional context in comments.

4 Comments

Deleted my comments and translated your answer to plain {collapse} with dplyr syntax. Of course, you are invited to copy over, then I will delete my answer since the ideas are yours!
@Friede both your suggestions worked perfectly on the main dataset and I didn't really notice a speed difference. In case you're curious, there were 196 groups in the full dataset. I went from 19254721 observations down to 2051033. I am still curious (for future reference if nothing else) if one of you knows why my original for-loop was creating those x.1 columns?
For 19e6 observations and 196 groups, {collapse} should be rapidly fast. It doesn't really matter if there are 5 or 200 groups for this type. Sorry, I do not want to debug your code, it is a lot.
|
4

I suspect {collapse} to be fast. Translating Jon Spring's answer, measuring run time, and applying on

# this is just re-sampling the rows of the toy data
# it does not increase the number of groups
blackbird2 = blackbird[sample(nrow(blackbird), 19e6, TRUE), ]

gives

> library(collapse)
> system.time({
>     r = 
>       blackbird2 |>
>       roworder(park_abbr, am_no, sci_name, start_s) |>
>       fgroup_by(park_abbr, am_no, sci_name) |>
>       fmutate(gap = start_s > L(end_s, fill=-100, 
>        g=c(park_abbr, am_no, sci_name, group))+20, group = cumsum(gap)) |>
>       fgroup_by(park_abbr, am_no, sci_name, group) |>
>       fsummarise(id = min(id), start_s = min(start_s), 
>                  end_s = max(end_s), conf = mean(conf)) 
> })
  user  system elapsed 
 1.370   0.124   1.497 

where

> r
 park_abbr am_no            sci_name group       id start_s end_s      conf
1        GV    A1 Aegithalos caudatus     1  5801137       0    79 0.7088477
2        NH    A1 Aegithalos caudatus     1 10399185     103   125 0.2015659
3        RO    A1 Aegithalos caudatus     1 13435015     732   757 0.5768488
4        TE    A1 Aegithalos caudatus     1 14544238     339   342 0.1887000
5        TE    A1 Aegithalos caudatus     2 14544245     495   498 0.3975000

Note

> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] collapse_2.1.2

loaded via a namespace (and not attached):
[1] compiler_4.5.0 parallel_4.5.0 tools_4.5.0    Rcpp_1.0.14   

2 Comments

Thanks for adding this. Your scaled-up example likely understates the performance improvement over dplyr since the example data only has 5 groups, whereas the real data could have millions.
@JonSpring, thanks you for the comment. I thought about re-sampling blackbird more closely but I am not too sure about the real structure.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.