= SMR Layout Optimisation for XFS (v0.2, March 2015) Dave Chinner, == Overview This document describes a relatively simple way of modifying XFS using existing on disk structures to be able to use host-managed SMR drives. This assumes a userspace ZBC implementation such as libzbc will do all the heavy lifting work of laying out the structure of the filesystem, and that it will perform things like zone write pointer checking/resetting before the filesystem is mounted. == Concepts SMR is architected to have a set of sequentially written zones which don't allow out of order writes, nor do they allow overwrites of data already written in the zone. Zones are typically in the order of 256MB, though may actually be of variable size as physical geometry of the drives differ from inner to outer edges. SMR drives also typically have an outer section that is CMR technology - it allows random writes and overwrites to any area within those zones. Drive managed SMR devices use this region for internal metadata journalling for block remapping tables and as a staging area for data writes before being written out in sequential fashion ito zones after block remapping has been performed. Recent research has shown that 6TB seagate drives have a 20-25GB CMR zone, which is more than enough for our purposes. Information from other vendors indicate that some drives will have much more CMR, hence if we design for the known sizes in the Seagate drives we will be fine for other drives just coming onto the market right now. For host managed/aware drives, we are going to assume that we can use this area directly for filesystem metadata - for our own mapping tables and things like the journal, inodes, directories and free space tracking. We are also going to assume that we can find these regions easily in the ZBC information, and that they are going to be contiguous rather than spread all over the drive. XFS already has a data-only device call the "real time" device, whose free space information is tracked externally in bitmaps attached to inodes that exist in the "data" device. All filesystem metadata exists in the "data" device, except maybe the journal which can also be in an external device. A key constraint we need to work within here is that RAID on SMR drives is a long way off. The main use case is for bulk storage of data in the back end of distributed object stores (i.e. cat pictures on the intertubes) and hence a filesystem per drive is the typical configuration we'll be chasing here. Similarly, partitioning of SMR drives makes no sense for host aware drives, so we are going to constrain the architecture to a single drive for now. == Journal modifications Because the XFS journal is a sequentially written circular log, we can actually use SMR zones for it - it does not need to be in the metadata region. This requires a small amount of additional complexity - we can't wrap the log as we currnetly do, we'll need to split the log across two zones so that we can push the tail into the same zone as the head, then reset the now unused zone and then when the log wraps it can simply start again form the beginning of the erased zone. Like a normal spinning disk, we'll want to place the log in a pair of zones near the middle of the drive so that we minimise the worst case seek cost of a log write to half of a full disk seek. There may be advantage to putting it right next to the metadata zone, but typically metadata writes are not correlated with log writes. Hence the only real functionality we need to add to the log is the tail pushing modifications to move the tail into the same zone as the head, as well as being able to trigger and block on zone write pointer reset operations. The log doesn't actually need to track the zone write pointer, though log recovery will need to limit the recovery head to the current write pointer of the lead zone. Modifications here are limited to the function that finds the head of the log, and can actually be used to speed up the search algorithm. However, given the size of the CMR zones, we can host the journal in an unmodified manner inside the CMR zone and not have to worry about zone awareness. This is by far the simplest solution to the problem. == Data zones What we need is a mechanism for tracking the location of zones (i.e. start LBA), free space/write pointers within each zone, and some way of keeping track of that information across mounts. If we assign a real time bitmap/summary inode pair to each zone, we have a method of tracking free space in the zone. We can use the existing bitmap allocator with a small tweak (sequentially ascending, packed extent allocation only) to ensure that newly written blocks are allocated in a sane manner. We're going to need userspace to be able to see the contents of these inodes; read only access will be needed to analyse the contents of the zone, so we're going to need a special directory to expose this information. It would be useful to have a ".zones" directory hanging off the root directory that contains all the zone allocation inodes so userspace can simply open them. THis biggest issue that has come to light here is the number of zones in a device. Zones are typically 256MB in size, and so we are looking at 4,000 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if the devices keep getting larger at the expected rate, we're going to have to deal with zone counts in the hundreds of thousands. Hence a single flat directory containing all these inodes is not going to scale, nor will we be able to keep them all in memory at once. As a result, we are going to need to group the zones for locality and efficiency purposes, likely as "zone groups" of, say, up to 1TB in size. Luckily, by keeping the zone information in inodes the information can be demand paged and so we don't need to pin thousands of inodes and bitmaps in memory. Zone groups also have other benefits... While it seems like tracking free space is trivial for the purposes of allocation (and it is!), the complexity comes when we start to delete or overwrite data. Suddenly zones no longer contain contiguous ranges of valid data; they have "freed" extents in the middle of them that contain stale data. We can't use that "stale space" until the entire zone is made up of "stale" extents. Hence we need a Cleaner. === Zone Cleaner The purpose of the cleaner is to find zones that are mostly stale space and consolidate the remaining referenced data into a new, contiguous zone, enabling us to then "clean" the stale zone and make it available for writing new data again. The real complexity here is finding the owner of the data that needs to be move, but we are in the process of solving that with the reverse mapping btree and parent pointer functionality. This gives us the mechanism by which we can quickly re-organise files that have extents in zones that need cleaning. The key word here is "reorganise". We have a tool that already reorganises file layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr - instead of trying to minimise fixpel fragments, it finds zones that need cleaning by reading their summary info from the /.zones/ directory and analysing the free bitmap state if there is a high enough percentage of stale blocks. From there we can use the reverse mapping to find the inodes that own the extents those zones. And from there, we can run the existing defrag code to rewrite the data in the file, thereby marking all the old blocks stale. This will make almost stale zones entirely stale, and hence then be able to be reset. Hence we don't actually need any major new data moving functionality in the kernel to enable this, except maybe an event channel for the kernel to tell xfs_fsr it needs to do some cleaning work. If we arrange zones into zoen groups, we also have a method for keeping new allocations out of regions we are re-organising. That is, we need to be able to mark zone groups as "read only" so the kernel will not attempt to allocate from them while the cleaner is running and re-organising the data within the zones in a zone group. This ZG also allows the cleaner to maintain some level of locality to the data that it is re-arranging. === Reverse mapping btrees One of the complexities is that the current reverse map btree is a per allocation group construct. This means that, as per the current design and implementation, it will not work with the inode based bitmap allocator. This, however, is not actually a major problem thanks to the generic btree library that XFS uses. That is, the generic btree library in XFS is used to implement the block mapping btree held in the data fork of the inode. Hence we can use the same btree implementation as the per-AG rmap btree, but simply add a couple of functions, set a couple of flags and host it in the inode data fork of a third per-zone inode to track the zone's owner information. == Mkfs Mkfs is going to have to integrate with the userspace zbc libraries to query the layout of zones from the underlying disk and then do some magic to lay out al the necessary metadata correctly. I don't see there being any significant challenge to doing this, but we will need a stable libzbc API to work with and it will need ot be packaged by distros. If mkfs cannot find ensough random write space for the amount of metadata we need to track all the space in the sequential write zones and a decent amount of internal fielsystem metadata (inodes, etc) then it will need to fail. Drive vendors are going to need to provide sufficient space in these regions for us to be able to make use of it, otherwise we'll simply not be able to do what we need to do. mkfs will need to initialise all the zone allocation inodes, reset all the zone write pointers, create the /.zones directory, place the log in an appropriate place and initialise the metadata device as well. == Repair Because we've limited the metadata to a section of the drive that can be overwritten, we don't have to make significant changes to xfs_repair. It will need to be taught about the multiple zone allocation bitmaps for it's space reference checking, but otherwise all the infrastructure we need ifor using bitmaps for verifying used space should already be there. THere be dragons waiting for us if we don't have random write zones for metadata. If that happens, we cannot repair metadata in place and we will have to redesign xfs_repair from the ground up to support such functionality. That's jus tnot going to happen, so we'll need drives with a significant amount of random write space for all our metadata...... == Quantification of Random Write Zone Capacity A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB for free space bitmaps. We'll want to support at least 1 million inodes per TB, so that's another 512MB per TB, plus another 256MB per TB for directory structures. There's other bits and pieces of metadata as well (attribute space, internal freespace btrees, reverse map btrees, etc. So, at minimum we will probably need at least 2GB of random write space per TB of SMR zone data space. Plus a couple of GB for the journal if we want the easy option. For those drive vendors out there that are listening and want good performance, replace the CMR region with a SSD.... == Kernel implementation The allocator will need to learn about multiple allocation zones based on bitmaps. They aren't really allocation groups, but the initialisation and iteration of them is going to be similar to allocation groups. To get use going we can do some simple mapping between inode AG and data AZ mapping so that we keep some form of locality to related data (e.g. grouping of data by parent directory). We can do simple things first - simply rotoring allocation across zones will get us moving very quickly, and then we can refine it once we have more than just a proof of concept prototype. Optimising data allocation for SMR is going to be tricky, and I hope to be able to leave that to drive vendor engineers.... Ideally, we won't need a zbc interface in the kernel, except to erase zones. I'd like to see an interface that doesn't even require that. For example, we issue a discard (TRIM) on an entire zone and that erases it and resets the write pointer. This way we need no new infrastructure at the filesystem layer to implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR drive underneath it. == Problem cases There are a few elephants in the room. === Concurrent writes What happens when an application does concurrent writes into a file (either by threads or AIO), and allocation happens in the opposite order to the IO being dispatched. i.e., with a zone write pointer at block X, this happens: ---- Task A Task B write N write N + 1 allocate X allocate X + 1 submit_bio submit_bio IO to block X+1 dispatched. ---- And so even though we allocated the IO in incoming order, the dispatch order was different. I don't see how the filesystem can prevent this from occurring, except to completely serialise IO to zone. i.e. while we have a block allocation and no write completion, no other allocations to that zone can take place. If that's the case, this is going to cause massive fragmentation and/or severe IO latency problems for any application that has this sort of IO engine. There is a block layer solution to this in the works - the block layer will track the write pointer in each zone and if it gets writes out of order it will requeue the IO at the tail of the queue, hence allowing the IO that has been delayed to be issued before the out of order write. === Crash recovery Write pointer location is undefined after power failure. It could be at an old location, the current location or anywhere in between. The only guarantee that we have is that if we flushed the cache (i.e. fsync'd a file) then they will at least be in a position at or past the location of the fsync. Hence before a filesystem runs journal recovery, all it's zone allocation write pointers need to be set to what the drive thinks they are, and all of the zone allocation beyond the write pointer need to be cleared. We could do this during log recovery in kernel, but that means we need full ZBC awareness in log recovery to iterate and query all the zones. Hence it's not clear if we want to do this in userspace as that has it's own problems e.g. we'd need to have xfs.fsck detect that it's a smr filesystem and perform that recovery, or write a mount.xfs helper that does it prior to mounting the filesystem. Either way, we need to synchronise the on-disk filesystem state to the internal disk zone state before doing anything else. This needs more thought, because I have a nagging suspiscion that we need to do this write pointer resynchronisation *after log recovery* has completed so we can determine if we've got to now go and free extents that the filesystem has allocated and are referenced by some inode out there. This, again, will require reverse mapping lookups to solve. === Preallocation Issues Because we can only do sequential writes, we can only allocate space that exactly matches the write being performed. That means we *cannot preallocate extents*. The reason for this is that preallocation will physically separate the data write location from the zone write pointer. e.g. if we use preallocation to allocate space we are about to do random writes into to prevent fragmentation. We cannot do this on ZBC drives, we have to allocate specifically for the IO we are going to perform. As a result, we lose almost all the existing mechanisms we use for preventing fragmentation. Speculative EOF preallocation with delayed allocation cannot be used, fallocate cannot be used to preallocate physical extents, and extent size hints cannot be used because they do "allocate around" writes. We're trying to do better without much investment in time and resources here, so the compromise is that we are going to have to rely on xfs_fsr to clean up fragmentation after the fact. Luckily, the other functions we need from xfs_fsr (zone cleaning) also act to defragment free space so we don't have to care about trading contiguous filesystem for free space fragmentation and that downward spiral. I suspect the best we will be able to do with fallocate based preallocation is to mark the region as delayed allocation. === Allocation Alignment With zone based write pointers, we lose all capability of write alignment to the underlying storage - our only choice to write is the current set of write pointers we have access to. There are several methods we could use to work around this problem (e.g. put a slab-like allocator on top of the zones) but that requires completely redesigning the allocators for SMR. Again, this may be a step too far.... === RAID on SMR.... How does RAID work with SMR, and exactly what does that look like to the filesystem? How does libzbc work with RAID given it is implemented through the scsi ioctl interface? How does RAID repair parity errors in place? Or does the RAID layer now need a remapping layer so the LBA or rewritten stripes remain the same? Indeed, how do we handle partial stripe writes which will require multiple parity block writes? What does the geometry look like (stripe unit, width) and what does the write pointer look like? How does RAID track all the necessary write pointers and keep them in sync? What about RAID1 with it's dirty region logging to minimise resync time and overhead?