aboutsummaryrefslogtreecommitdiffstats
path: root/design/XFS_Filesystem_Structure/realtime.asciidoc
blob: c767489dc8628e9d407f02ce328cf8904531e087 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
[[Real-time_Devices]]
= Real-time Devices

The performance of the standard XFS allocator varies depending on the internal
state of the various metadata indices enabled on the filesystem.  For
applications which need to minimize the jitter of allocation latency, XFS
supports the notion of a ``real-time device''.  This is a special device
separate from the regular filesystem where extent allocations are tracked with
a bitmap and free space is indexed with a two-dimensional array.  If an inode
is flagged with +XFS_DIFLAG_REALTIME+, its data will live on the real time
device.

By placing the real time device (and the journal) on separate high-performance
storage devices, it is possible to reduce most of the unpredictability in I/O
response times that come from metadata operations.

None of the XFS per-AG B+trees are involved with real time files.

[[Real-Time_Bitmap_Inode]]
== Free Space Bitmap Inode

The real time bitmap inode, +sb_rbmino+, tracks the used/free space in the
real-time device using an old-style bitmap. One bit is allocated per real-time
extent. The size of an extent is specified by the superblock's +sb_rextsize+
value.

The number of blocks used by the bitmap inode is equal to the number of
real-time extents (+sb_rextents+) divided by the block size (+sb_blocksize+)
and bits per byte. This value is stored in +sb_rbmblocks+. The nblocks and
extent array for the inode should match this.  Each real time block gets its
own bit in the bitmap.

If the +XFS_SB_FEAT_INCOMPAT_METADIR+ feature is enabled, each block of the
realtime bitmap file has a header of the following format:

[source, c]
----
struct xfs_rtbuf_blkinfo {
	__be32		rt_magic;
	__be32		rt_crc;
	__be64		rt_owner;
	__be64		rt_blkno;
	__be64		rt_lsn;
	uuid_t		rt_uuid;
};
----

*rt_magic*::
Specifies the magic number for the rtbitmap block: ``BMPZ'' (0x424D505A).

*rt_crc*::
Checksum of the block.

*rt_owner*::
Specifies the inode number for the file that owns this block.

*rt_blkno*::
Disk address of this block.

*rt_lsn*::
Log sequence number of the last write to this block.

*rt_uuid*::
The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
depending on which features are set.

After the block header, the bitmap data are encoded as be32 word values.

=== xfs_db rtbitmap Example

This example shows a real-time bitmap file from a freshly populated filesystem:

----
xfs_db> path -m /rtgroups/3.bitmap
xfs_db> p
core.magic = 0x494e
core.mode = 0100000
core.version = 3
core.format = 2 (extents)
core.metatype = 5 (rtbitmap)
core.uid = 0
core.gid = 0
core.nlinkv2 = 1
core.projid_lo = 3
core.projid_hi = 0
core.nextents = 1
core.atime.sec = Tue Oct 15 16:04:02 2024
core.atime.nsec = 769675000
core.mtime.sec = Tue Oct 15 16:04:02 2024
core.mtime.nsec = 769675000
core.ctime.sec = Tue Oct 15 16:04:02 2024
core.ctime.nsec = 769681000
core.size = 135168
core.nblocks = 33
core.extsize = 0
core.naextents = 0
core.forkoff = 24
core.aformat = 1 (local)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 1
core.append = 0
core.sync = 1
core.noatime = 1
core.nodump = 1
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 1
core.filestream = 0
core.gen = 2653591217
next_unlinked = null
v3.crc = 0x34a17119 (correct)
v3.change_count = 3
v3.lsn = 0
v3.flags2 = 0x38
v3.cowextsize = 0
v3.crtime.sec = Tue Oct 15 16:04:02 2024
v3.crtime.nsec = 769675000
v3.inumber = 33685633
v3.uuid = a6575f59-1514-445e-883e-211b2c5a0f05
v3.reflink = 0
v3.cowextsz = 0
v3.dax = 0
v3.bigtime = 1
v3.nrext64 = 1
v3.metadata = 1
u3.bmx[0] = [startoff,startblock,blockcount,extentflag] 
0:[0,4210712,33,0]
a.sfattr.hdr.totsize = 27
a.sfattr.hdr.count = 1
a.sfattr.list[0].namelen = 8
a.sfattr.list[0].valuelen = 12
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].parent = 1
a.sfattr.list[0].name = "0.bitmap"
a.sfattr.list[0].parent_dir.inumber = 33685632
a.sfattr.list[0].parent_dir.gen = 142228546
xfs_db> dblock 0
xfs_db> p
magicnum = 0x424d505a
crc = 0xc8b10abf (correct)
owner = 33685633
bno = 20902080
lsn = 0x100007696
uuid = a6575f59-1514-445e-883e-211b2c5a0f05
rtwords[0-1011] = 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0
14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0xfffff800 22:0xffffffff 23:0xffffffff
24:0xffffffff 25:0xffffffff 26:0xffffffff 27:0xffffffff 28:0xffffffff
29:0xffffffff 30:0xffffffff 31:0xffffffff 32:0xffffffff
...
979:0xffffffff 980:0xffffffff 981:0xffffffff 982:0xffffffff 983:0xffffffff
984:0xffffffff 985:0xffffffff 986:0xffffffff 987:0xffffffff 988:0xffffffff
989:0xffffffff 990:0xffffffff 991:0xffffffff 992:0xffffffff 993:0xffffffff
994:0xffffffff 995:0xffffffff 996:0xffffffff 997:0xffffffff 998:0xffffffff
999:0xffffffff 1000:0xffffffff 1001:0xffffffff 1002:0xffffffff 1003:0xffffffff
1004:0xffffffff 1005:0xffffffff 1006:0xffffffff 1007:0xffffffff 1008:0xffffffff
1009:0xffffffff 1010:0xffffffff 1011:0xffffffff
----

From this example, we can clearly see that this is a bitmap file in the
metadata directory tree, and that it is the bitmap file for rtgroup 3.  When we
access the first block in the bitmap file, we can clearly see the new block
header and that the first 179 extents are allocated.  The bitmap words were
excerpted for brevity.

[[Real-Time_Summary_Inode]]
== Free Space Summary Inode

The real time summary inode, +sb_rsumino+, tracks the used and free space
accounting information for the real-time device.  This file indexes the
approximate location of each free extent on the real-time device first by
log2(extent size) and then by the real-time bitmap block number.  The size of
the summary inode file is equal to +sb_rbmblocks+ × log2(realtime device size)
× sizeof(+xfs_suminfo_t+).  The entry for a given log2(extent size) and
rtbitmap block number is 0 if there is no free extents of that size at that
rtbitmap location, and positive if there are any.

This data structure is not particularly space efficient, however it is a very
fast way to provide the same data as the two free space B+trees for regular
files since the space is preallocated and metadata maintenance is minimal.

If the +XFS_SB_FEAT_INCOMPAT_METADIR+ feature is enabled, each block of the
realtime summary file has the same header as rtbitmap file blocks.  However,
the magic number will be ``SUMY'' (0x53554D59).  After the block header, the
summary counts are encoded as be32 integers.

=== xfs_db rtsummary Example

This example shows a real-time summary file from a freshly populated filesystem:

----
xfs_db> path -m /rtgroups/3.summary
xfs_db> p
core.magic = 0x494e
core.mode = 0100000
core.version = 3
core.format = 2 (extents)
core.metatype = 6 (rtsummary)
core.uid = 0
core.gid = 0
core.nlinkv2 = 1
core.projid_lo = 3
core.projid_hi = 0
core.nextents = 1
core.atime.sec = Tue Oct 15 16:04:02 2024
core.atime.nsec = 769694000
core.mtime.sec = Tue Oct 15 16:04:02 2024
core.mtime.nsec = 769694000
core.ctime.sec = Tue Oct 15 16:04:02 2024
core.ctime.nsec = 769699000
core.size = 4096
core.nblocks = 1
core.extsize = 0
core.naextents = 0
core.forkoff = 24
core.aformat = 1 (local)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 1
core.append = 0
core.sync = 1
core.noatime = 1
core.nodump = 1
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 1
core.filestream = 0
core.gen = 519466891
next_unlinked = null
v3.crc = 0x54fc58d0 (correct)
v3.change_count = 3
v3.lsn = 0
v3.flags2 = 0x38
v3.cowextsize = 0
v3.crtime.sec = Tue Oct 15 16:04:02 2024
v3.crtime.nsec = 769694000
v3.inumber = 33685634
v3.uuid = a6575f59-1514-445e-883e-211b2c5a0f05
v3.reflink = 0
v3.cowextsz = 0
v3.dax = 0
v3.bigtime = 1
v3.nrext64 = 1
v3.metadata = 1
u3.bmx[0] = [startoff,startblock,blockcount,extentflag] 
0:[0,4210703,1,0]
a.sfattr.hdr.totsize = 28
a.sfattr.hdr.count = 1
a.sfattr.list[0].namelen = 9
a.sfattr.list[0].valuelen = 12
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].parent = 1
a.sfattr.list[0].name = "0.summary"
a.sfattr.list[0].parent_dir.inumber = 33685632
a.sfattr.list[0].parent_dir.gen = 142228546
xfs_db> dblock 0
xfs_db> p
magicnum = 0x53554d59
crc = 0x473340a8 (correct)
owner = 33685634
bno = 20902008
lsn = 0x100007696
uuid = a6575f59-1514-445e-883e-211b2c5a0f05
suminfo[0-1011] = 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0
14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0 22:0 23:0 24:0 25:0 26:0 27:0 28:0 29:0
30:0 31:0 32:0
...
618:0 619:0 620:0 621:0 622:0 623:0 624:0 625:0 626:0 627:1 628:0 629:0 630:0
...
979:0 980:0 981:0 982:0 983:0 984:0 985:0 986:0 987:0 988:0 989:0 990:0 991:0
992:0 993:0 994:0 995:0 996:0 997:0 998:0 999:0 1000:0 1001:0 1002:0 1003:0
1004:0 1005:0 1006:0 1007:0 1008:0 1009:0 1010:0 1011:0
----

From this example, we can clearly see that this is a summary file in the
metadata directory tree, and that it is the summary file for rtgroup 3.  When
we access the first block in the summary file, we can clearly see the new block
header and the nonzero counter for the one large free extent in this group.
The summary counts were excerpted for brevity.

[[Realtime_Groups]]
== Realtime Groups

To reduce metadata contention for space allocation and remapping activities
being applied to realtime files, the realtime volume can be split into
allocation groups, just like the data volume.  The free space information is
still contained in a single file that applies to the entire volume.  This
sharding enables code reuse between the data and realtime reverse mapping
indexes and supports parallelism of reverse mapping and online fsck activities.

Each realtime allocation group can contain up to (2^31^ - 1) filesystem blocks,
regardless of the underlying realtime extent size.

Each realtime group has the following characteristics:

         * Group 0 has a super block describing overall filesystem info
         * Free space bitmap
         * Summary of free space
         * Reverse space mapping btree
         * Reference count btree

The free space metadata are the same as described in the previous sections,
except that their scope covers only a single rtgroup.  The other structures are
expanded upon in the following sections.

[[Realtime_Group_Superblocks]]
=== Superblocks

The first block of each realtime group contains a superblock.  These fields
must match their counterparts in the filesystem superblock on the data device.

[source, c]
----
struct xfs_rtsb {
	__be32		rsb_magicnum;
	__le32		rsb_crc;

	__be32		rsb_pad;
	unsigned char	rsb_fname[XFSLABEL_MAX];

	uuid_t		rsb_uuid;
	uuid_t		rsb_meta_uuid;

	/* must be padded to 64 bit alignment */
};
----

*rsb_magicnum*::
Identifies the filesystem. Its value is +XFS_RTSB_MAGIC+ ``Frog'' (0x46726F67).

*rsb_crc*::
Superblock checksum.

*rsb_pad*::
Must be zero.

*rsb_fname[12]*::
Name for the filesystem.  This matches +sb_fname+ in the primary superblock.

*rsb_uuid*::
UUID (Universally Unique ID) for the filesystem.  This matches +sb_uuid+ in the
primary superblock.

*rsb_meta_uuid*::
Metadata UUID for the filesystem.  This matches +sb_meta_uuid+ in the primary
superblock.

==== xfs_db rtgroup Superblock Example

A filesystem is made on a multidisk filesystem with the following command:

----
# mkfs.xfs -r rtgroups=1,rgcount=4,rtdev=/dev/sdb /dev/sda -f
meta-data=/dev/sda               isize=512    agcount=4, agsize=1298176 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       metadir=1
data     =                       bsize=4096   blocks=5192704, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/sdb               extsz=4096   blocks=5192704, rtextents=5192704
         =                       rgcount=5    rgsize=1048576 extents
----

And in xfs_db, inspecting the realtime group superblock and then the regular
superblock:

----
# xfs_db -R /dev/sdb /dev/sda
xfs_db> rtsb
xfs_db> print
magicnum = 0x46726f67
crc = 0x759a62d4 (correct)
pad = 0
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
uuid = 7e55b909-8728-4d69-a1fa-891427314eea
meta_uuid = 7e55b909-8728-4d69-a1fa-891427314eea
----

include::rtrmapbt.asciidoc[]

include::rtrefcountbt.asciidoc[]

[[Zoned]]
== Zoned Real-time Devices

If the +XFS_SB_FEAT_INCOMPAT_ZONED+ feature is enabled, the real time device
uses an entirely different space allocator.  This features does not use the
xref:Real-Time_Bitmap_Inode[Free Space Bitmap Inode] and
xref:Real-Time_Summary_Inode[Free Space Summary Inode].
Instead, writes to the storage hardware must always occur sequentially
from the start to the end of a rtgroup.  To support this requirement,
file data are always written out of place using the so called copy on write
or COW write path (which actually just redirects on write and never copies).

When an rtgroup runs out of space to write, free space is reclaimed by
copying and remapping still valid data from the full rtgroups into
another rtgroup.  Once the rtgroup is empty, it is written to from the
beginning again.  For this, the
xref:Real_time_Reverse_Mapping_Btree[Reverse-Mapping B+tree] is required.

For storage hardware that supports hardware zones, each rtgroup is mapped
to exactly one zone.  When a file system is created on a a zoned storage
device that does support conventional (aka random writable) zones at the
beginning of the LBA space, those zones are used for the xfs data device
(which in this case is primarily used for metadata), and the zoned requiring
sequential writes are presented as the real-time device.  When an external
real-time device is used, rtgroups might also map to conventional zones.

Filesystems with a zoned real-time device by default use the real-time device
for all data, and the data device only for metadata, which makes the
terminology a bit confusing.  But this is merely the default setting.  Like
any other filesystem with a realtime volume, the +XFS_DIFLAG_REALTIME+ flag
can be cleared on an empty regular file to target the data device; and the
+XFS_DIFLAG_RTINHERIT+ flag can be cleared on a directory so that new
children will target the data device.