... especially when dealing with data arriving from external sources such as peripherals (e.g., UART, SPI).
This is not true, directly. For a UART/SPI you typically read from a data registers for a controller. This memory should NOT be marked as cacheable. So, in this case, there is no worry. The 2nd case is that you have a DMA engine that transfers these data register automatically to RAM. Again, you can handle this by allocating the DMA buffers so that they are not-cacheable.
In a resource constrained environment, you might wish to perform calculation on the DMA buffer. In this case, you can have the DMA controller flag when a buffer transfer is complete and invalidate the buffer so that the CPU will re-fetch this memory instead of using the cached version.
The term cache clean was also referred to as cache flush. This makes memory available to a DMA peripheral. Ie, it moves it from the cache to external memory It would be used when you are creating a transmit buffer that is cached by the CPU. I think at one point it would only flush, leaving the value in the cache. There are use cases for both, but generally you might as well invalidate it so that the cache can be used for other addresses. Then you have committed a buffer for transfer, you should be done with it and not change it in flight. The clean/invalidate should be done just before the DMA transmit is activated.
These items need to be performed as DMA is normally not cache aware.
I have experimented with both cache clean and invalidate functions, and I have not observed any performance degradation or data loss when using only the invalidate function.
It will be difficult to measure performance increase with 'clean+invalidate'. You need other accesses which cause cache hits that are allowed into the invalidated entries. If it is the same number of cycles, then 'clean+invalidate' is often better as you are telling the CPU/cache controller, you are done with this data set.
However, it is often the case that you operate with a buffer hiearchy, where data is transformed. Ie, an array of ADC channel 0-7 and then you take an SPI compatible message and transfer it to store only the ADC channels. In this case, it is better just to have the DMA buffer as uncached. Ie, you are just doing single reads from it.
The most complex use case is frame memory where you perform blitting directly on DMA that goes to the display. Here the memory is read/modify/write and the use of caching can speed things. This is far more common than DMA and cache with peripherals like SPI/UART,I2C,Ethernet, etc. Although many people will prefer to avoid this as well for a variety of reasons; Typically, it is double buffered (update and active screen) to avoid tearing, etc.
Why might we choose to use this function over separate cache clean and invalidate operations?
It is a single cycle on some CPUs. For example, DCCISW and DCCIMVAC. The operation to combine clean/invalidate is simple as they are updating bits in the same hardware structure. The time consuming issue is to flush the data from cache to memory. Note that the instruction may return BEFORE the flush is complete and there are other instructions to wait completely. For instance, DSB may be required to ensure data flushing/clean is completed.