Nowadays, I doubt you could find a single processor being manufactured with an L4 cache.
If you only have a single level of cache, that is going to be your L1 cache. I don't see the point in calling it L4. Unless you have another underlying question, from which this weird question originates (The XY problem).
Since there is a Line Fill Buffer (LFB) associated with the L1 cache, there would be an entry allocated in the LFB to track this miss (to retrieve and assemble the cache line as data is sent over).
The memory controller will further send the request to the appropriate DRAM chip, which would activate the corresponding row containing the data, followed by the proper column.
This part of the procedure is responsible for most of the latency you mentioned.
Since the bus is capable of transferring 16 bytes at a time in your scenario, the burst transfer of data to the processor would happen in 4 separate bus cycles (all part of one request/response transaction), from the DRAM to the processor.
This is not 4 different requests from the processor, rather one request for a cache line from the processor, and 4 separate chunks of data which are assembled later by the LFB and inserted into the L1. (Or assembled by the memory controller and sent over wider busses inside the CPU.)
These transfers occur in "burst mode" where the DRAM automatically sends sequential chunks after receiving the initial address, with the memory controller issuing just a single command to retrieve the entire cache line. This is due to the high memory access latency (most % of the 100ns in your scenario), which needs to be amortized by transferring more than a single-bus-width of data upon a request, which could later potentially be useful.
It's not a coincidence that DDR SDRAM's burst size is 64 bytes (or 32 for a short burst), same as the cache line width of typical CPUs. (DDR SDRAM's data width is only 64 bits, so your hypothetical system has a wider data path.)