No, there isn't an instruction you can run to query the cache status of an address.
Tuning prefetch-distance (prefetch data[i+distance] while processing data[i]) is unfortunately a matter of putting a fixed amount of doOtherWork between the load and the use of the load.
Keep in mind that memory-level parallelism is a thing, (e.g. Skylake can have up to 12 cache lines in flight to/from L1d cache; out-of-order exec can get later loads started while older loads are still waiting). Also, HW prefetch into L2 and L1d cache will already be close to optimal for simple cases like this (sequential access); they'll have data for the later cache lines flowing into caches while process(data[i]) is happening. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? - HW prefetch has come a long way since the original article was written. (But you should still read the original if you haven't.)
If your computational intensity is high enough (amount of work done per load bandwidth into registers or into L1d cache), HW prefetch will just keep up with what you're doing. e.g. by doing more work on each pass over your data so its still in registers, or cache-blocking so you re-read some data you used recently so you get L1d hits.
Tuning software prefetch is unfortunately difficult and system-specific, and can even depend on competition from other cores for memory bandwidth. It would be nice if there was a low-overhead way to make prefetch distance dynamic like this, but I don't think there is. Especially NT prefetch, which doesn't "pollute" L2 cache by leaving a copy there. (Or L3 cache on systems where it's not inclusive). So if you prefetch too far ahead and the data is evicted before your code uses it, you get another cache miss all the way to L3 or DRAM.
There are PMU event counters you can use on the whole loop (most easily by isolating the loop into a whole program that is just a microbenchmark), like mem_load_retired.l1_miss (which exists on my Intel Skylake using Linux perf; available events and their names differ by microarchitecture.) If you're trying to prefetch early enough for data to be in L1d, you want to keep increasing prefetch distance until l1_miss counts drop.
Another relevant event is cycle_activity.stalls_l3_miss which counts cycles (or just starts of stalls, not every cycle of each stall?) that happen while the core is waiting for a load result that missed in L3. (A "demand" miss is one that isn't from the prefetchers.)
A stall means the ReOrder Buffer (ROB) filled, or some other back-end resource like load-buffer entries, so the front-end couldn't "issue" (non-Intel terminology: "dispatch") any more instructions / uops into the out-of-order back-end. The ROB is a circular buffer that instructions issue into and retire from in program order. (The scheduler or schedulers are separate, and only track instructions that haven't yet executed; their entries can be freed out-of-order.) Other independent work can still have made progress during the stall (being executed and ready to retire), but one instruction which can't retire (e.g. a cache-miss load) blocks retirement since that has to happen in program order for precise exceptions (rolling back to a consistent state at the faulting instruction). When instructions aren't leaving the ROB, it will eventually stall when there's no room for new instructions to enter.
But one cache miss doesn't stall the whole core; one of the major reasons for doing out-of-order exec is to hide cache-miss latency.
because modern CPUs need to efficiently switch threads when the pipeline stalls,
Modern CPUs with multiple logical cores per physical core use fine-grained SMT, not switch-on-stall. See wikipedia and https://www.realworldtech.com/alpha-ev8-smt/ (a good article about the first implementation of SMT.) Also Modern Microprocessors
A 90-Minute Guide! has a section about SMT.
The front-end alternates cycles between logical cores unless one is stalled (either from its share of the back-end being full, or it being stalled in the front-end, e.g. on an I-cache miss, or stalled in the middle of branch mispredict recovery). Anyway, stall doesn't happen until the ROB (reorder buffer) is full, so individual cache misses aren't something the front-end would care about. –
prefetchand first use. Sequential memory access the hardware will always be caching ahead of you so there is no point checking inside the loop. You can do a bit of useful work in the dead time prior to starting the loop though. There is a quite an interesting read/write asymmetry in access times AOTBE memory writes should be made sequential access if at all possible. Random access reads are more forgiving.