I'm currently trying to filter a dataset containing audio data on bird species. The data looks like this:
head(audiomoth_sample)
| id | park | park_abbr | am_no | sci_name | com_name | start_s | end_s | conf | date_time |
|---|---|---|---|---|---|---|---|---|---|
| 102 | Appelscha | AA | A1 | Turdus_merula | Eurasian Blackbird | 26 | 29 | 0.5305 | 2025-04-29 16:55:00 |
| 103 | Appelscha | AA | A1 | Turdus_merula | Eurasian Blackbird | 27 | 30 | 0.3356 | 2025-04-29 16:55:00 |
| 104 | Appelscha | AA | A1 | Turdus_merula | Eurasian Blackbird | 28 | 31 | 0.1958 | 2025-04-29 16:55:00 |
| 105 | Appelscha | AA | A1 | Oriolus oriolus | Eurasian Golden Oriole | 31 | 34 | 0.1293 | 2025-04-29 16:55:00 |
| 106 | Appelscha | AA | A1 | Turdus_merula | Eurasian Blackbird | 31 | 34 | 0.1056 | 2025-04-29 16:55:00 |
| 107 | Appelscha | AA | A1 | Turdus_merula | Eurasian Blackbird | 32 | 35 | 0.3121 | 2025-04-29 16:55:00 |
In total I'm dealing with 31 parks, 10 recorders each (am_no column, "A1" to "A5" and "B1" to "B5"), and 124 bird species. You wont be seeing any other am_no apart from A1 as I've been working with a subset of the first 100 detections per park (so 3100 in total). The data in total is over 19 million observations, and was analysed in BirdNET in 3 second snippets, with 2 second overlap; so "start_s" is the starting second and "end_s" is the ending second of the respective snippet.
What I would like to do is combine these snippets into singing events, where all rows where a bird keeps singing, at the same park and recorder, uniterrupted (interruption is a gap of > 20 seconds, though this might change at a later date) are combined. In such a case, I'd want to retain the data of the first row, but add the end_s of the last and take the average "conf" (confidence level) of all rows inbetween including the first and last.
So far, I've split the audiomoth_sample dataframe into a list of 124 per-species dataframes to avoid two birds of different species singing at the same time resulting in a split in singing events. For example, in the table above, the Golden Oriole at id 105 would result in the Blackbird having two singing events, from id 102-104 and 106 onwards, despite the gap being less than 20 seconds.
I've managed to write a function that gets me most of the way to where I need to be, but I run into something weird in the output. Below is an example for the Eurasian Blackbird subset. I've left out the columns park and com_name since they are functionally identical to park_abbr and sci_name , and date_time since I haven't used it in the function thusfar; it's only the start-datetime of the original AudioMoth recording.
# loading library
library(tidyverse)
# recreating blackbird data
blackbird <- structure(list(id = c(5801137L, 5801138L, 5801139L, 5801146L,
5801147L, 5801148L, 5801149L, 5801150L, 5801154L, 5801155L, 5801156L,
5801157L, 5801158L, 5801168L, 5801178L, 5801188L, 5801189L, 5801190L,
5801191L, 5801192L, 5801193L, 5801194L, 5801195L, 5801196L, 5801197L,
5801198L, 5801200L, 5801202L, 5801203L, 5801204L, 5801206L, 5801208L,
5801211L, 5801215L, 5801217L, 5801219L, 5801220L, 5801221L, 5801222L,
5801223L, 5801224L, 5801230L, 5801231L, 5801232L, 5801235L, 10399185L,
10399202L, 13435015L, 13435017L, 13435018L, 13435019L, 13435020L,
13435021L, 13435022L, 13435023L, 13435024L, 13435025L, 13435026L,
13435027L, 13435028L, 13435030L, 13435032L, 13435033L, 14544238L,
14544245L), park_abbr = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L), levels = c("GV", "NH",
"RO", "TE"), class = "factor"), am_no = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = "A1", class = "factor"),
sci_name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = "Aegithalos caudatus", class = "factor"),
start_s = c(0L, 1L, 2L, 6L, 7L, 8L, 9L, 10L, 14L, 15L, 16L,
17L, 18L, 25L, 32L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L,
50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 61L, 62L,
63L, 64L, 65L, 66L, 67L, 68L, 73L, 74L, 75L, 76L, 103L, 122L,
732L, 733L, 734L, 742L, 743L, 744L, 745L, 746L, 747L, 748L,
749L, 750L, 751L, 752L, 753L, 754L, 339L, 495L), end_s = c(3L,
4L, 5L, 9L, 10L, 11L, 12L, 13L, 17L, 18L, 19L, 20L, 21L,
28L, 35L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 64L, 65L, 66L, 67L,
68L, 69L, 70L, 71L, 76L, 77L, 78L, 79L, 106L, 125L, 735L,
736L, 737L, 745L, 746L, 747L, 748L, 749L, 750L, 751L, 752L,
753L, 754L, 755L, 756L, 757L, 342L, 498L), conf = c(0.2234,
0.5614, 0.4781, 0.1463, 0.3758, 0.5148, 0.7357, 0.5954, 0.3679,
0.8198, 0.5869, 0.82, 0.3683, 0.3678, 0.1484, 0.9014, 0.9899,
0.9958, 0.9964, 0.9957, 0.9467, 0.9939, 0.9917, 0.9915, 0.9533,
0.7444, 0.8523, 0.9729, 0.9253, 0.6424, 0.8775, 0.5058, 0.3696,
0.8907, 0.8253, 0.9798, 0.8965, 0.9547, 0.7425, 0.9277, 0.7885,
0.9165, 0.5399, 0.5519, 0.1256, 0.1927, 0.2104, 0.546, 0.1157,
0.1264, 0.4881, 0.8465, 0.5787, 0.9218, 0.7392, 0.9863, 0.9783,
0.9544, 0.2077, 0.1305, 0.6694, 0.1298, 0.8092, 0.1887, 0.3975
)), row.names = c(NA, -65L), class = "data.frame")
# setting and running the function
clean.dataset <- function(df) {
cleaned_df <- data.frame()
last_park <- NULL
last_am_no <- NULL
last_start <- NULL
last_end <- NULL
last_conf <- NULL
keeping_count <- NULL
for (i in 1:nrow(df)) {
park <- df$park_abbr[i]
am_no <- df$am_no[i]
start <- df$start_s[i]
end <- df$end_s[i]
conf <- df$conf[i]
if (
is.null(last_start)
) {
cleaned_df <- rbind(cleaned_df, df[i, ])
last_park <- park
last_am_no <- am_no
last_start <- start
last_end <- end
last_conf <- conf
keeping_count <- 1
} else {
if (
start - last_end < 20 && last_park == park && last_am_no == am_no
) {
last_end <- end
last_conf <- last_conf + conf
keeping_count <- keeping_count + 1
} else {
if (park != last_park || last_am_no != am_no) {
cleaned_df <- rbind(cleaned_df, df[i, ])
last_conf <- last_conf / keeping_count
cleaned_df[i - keeping_count, 6] <- last_end # 6 was 8 to match full data
cleaned_df[i - keeping_count, 7] <- last_conf # 7 was 9 to match full data
last_park <- park
last_am_no <- am_no
last_start <- start
last_end <- end
last_conf <- conf
keeping_count <- 1
} else {
cleaned_df <- rbind(cleaned_df, df[i, ])
last_conf <- last_conf / keeping_count
cleaned_df[i - keeping_count, 6] <- last_end # 6 was 8 to match full data
cleaned_df[i - keeping_count, 7] <- last_conf # 7 was 9 to match full data
last_park <- park
last_am_no <- am_no
last_start <- start
last_end <- end
last_conf <- conf
keeping_count <- 1
}
}
}
}
return(cleaned_df)
}
clean.dataset(blackbird)
From this subset, I expected a data.frame that looks like this:
| id | park_abbr | am_no | sci_name | start_s | end_s | conf |
|---|---|---|---|---|---|---|
| 5801137 | GV | A1 | Aegithalos caudatus | 0 | 79 | 0.7088022 |
| 10399185 | NH | A1 | Aegithalos caudatus | 103 | 125 | 0.2104 |
| 13435015 | RO | A1 | Aegithalos caudatus | 122 | 757 | 0.57675 |
| 14544238 | TE | A1 | Aegithalos caudatus | 339 | 342 | 0.1887 |
| 14544245 | TE | A1 | Aegithalos caudatus | 495 | 498 | 0.3975 |
Instead, I get this. Apologies for the pictures but I didn't know how to capture this otherwise. As you can see, The first row is correct, but the second two have incorrect end times and confidence levels; they are just the ones from the raw data. They are followed by a lot of rows filled with just NA's . This continues until for a while, row 28 to 37 are also filled with NA's. The second section of the table is mostly NA's too, but also contains rows named 46.1, 48.1, and 64.1, which contain only the correct end time and confidence values which should've been in rows 46, 48 and 64 respectively.
I've checked some of the other large sub-dataframes and they have the same issues. All of the NA rows with no data in them aren't really a big deal, as I could just filter those out with blackbird <- blackbird[complete.cases(blackbird), ]. However, the "x.1" rows with the correct data in them is something I don't understand and haven't been able to find anything about either.
I suspect my issue lies somewhere in the keeping_count variable I've used to divide the sum-total of all the last_conf variables by. I think something is going wrong when I use cleaned_df[i - keeping_count, 6] and cleaned_df[i - keeping_count, 7] to assign the new values to an already written row. I've tried adding a variable write <- 1 to the function that gets a write <- write +1 whenever keeping_count gets reset to 1 or gets a +1. Using that instead of keeping_count whenever I need to index a row in the dataframe (e.g. cleaned_df[i - write, 6]) seems to get rid of the NA rows and the "x.1" rows, at least for the blackbird subset, but results in incorrect data being written in the end_s and conf columns instead.
Does anyone know why the "x.1" rows are showing up instead of the data getting written to where I want it? How would I get the data in the right place?
cleaned_dfis populated with new rows wheniis 1, and 46, corresponding to new values ofpark_abbr. Wheniis 48, you bind in row 48, makingcleaned_dfbe three rows with row names 1, 46, and 48. But then your linecleaned_df[i - keeping_count, 6] <- last_endtells R to put the value 735 (last_end) into the 46th (iof 48 minuskeeping_countof 2) row ofcleaned_dfthat only had 3 rows. The new rows have sequential row names, so row 46 is a repeat of your 2nd row name, so it is assigned row46.1to avoid duplication.