[[On-disk_Inode]] = On-disk Inode All files, directories, and links are stored on disk with inodes and descend from the root inode with its number defined in the xref:Superblocks[superblock]. The previous section on xref:AG_Inode_Management[AG Inode Management] describes the allocation and management of inodes on disk. This section describes the contents of inodes themselves. An inode is divided into 3 parts: .On-disk inode sections image::images/23.png[] * The core contains what the inode represents, stat data, and information describing the data and attribute forks. * The +di_u+ ``data fork'' contains normal data related to the inode. Its contents depends on the file type specified by +di_core.di_mode+ (eg. regular file, directory, link, etc) and how much information is contained in the file which determined by +di_core.di_format+. The following union to represent this data is declared as follows: [source, c] ---- union { xfs_bmdr_block_t di_bmbt; xfs_bmbt_rec_t di_bmx[1]; xfs_dir2_sf_t di_dir2sf; char di_c[1]; xfs_dev_t di_dev; uuid_t di_muuid; char di_symlink[1]; } di_u; ---- * The +di_a+ ``attribute fork'' contains extended attributes. Its layout is determined by the +di_core.di_aformat+ value. Its representation is declared as follows: [source, c] ---- union { xfs_bmdr_block_t di_abmbt; xfs_bmbt_rec_t di_abmx[1]; xfs_attr_shortform_t di_attrsf; } di_a; ---- [NOTE] The above two unions are rarely used in the XFS code, but the structures within the union are directly cast depending on the +di_mode/di_format+ and +di_aformat+ values. They are referenced in this document to make it easier to explain the various structures in use within the inode. The remaining space in the inode after +di_next_unlinked+ where the two forks are located is called the inode's ``literal area''. This starts at offset 100 (0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3 inode. The space for each of the two forks in the literal area is determined by the inode size, and +di_core.di_forkoff+. The data fork is located between the start of the literal area and +di_forkoff+. The attribute fork is located between +di_forkoff+ and the end of the inode. [[Inode_Core]] == Inode Core The inode's core is 96 bytes on a V4 filesystem and 176 bytes on a V5 filesystem. It contains information about the file itself including most stat data information about data and attribute forks after the core within the inode. It uses the following structure: [source, c] ---- struct xfs_dinode_core { __uint16_t di_magic; __uint16_t di_mode; __int8_t di_version; __int8_t di_format; union { __uint16_t di_onlink; __uint16_t di_metatype; }; __uint32_t di_uid; __uint32_t di_gid; __uint32_t di_nlink; __uint16_t di_projid; __uint16_t di_projid_hi; union { /* Number of data fork extents if NREXT64 is set */ __be64 di_big_nextents; /* Padding for V3 inodes without NREXT64 set. */ __be64 di_v3_pad; /* Padding and inode flush counter for V2 inodes. */ struct { __u8 di_v2_pad[6]; __be16 di_flushiter; }; }; xfs_timestamp_t di_atime; xfs_timestamp_t di_mtime; xfs_timestamp_t di_ctime; xfs_fsize_t di_size; xfs_rfsblock_t di_nblocks; xfs_extlen_t di_extsize; union { /* * For V2 inodes and V3 inodes without NREXT64 set, this * is the number of data and attr fork extents. */ struct { __be32 di_nextents; __be16 di_anextents; } __packed; /* Number of attr fork extents if NREXT64 is set. */ struct { __be32 di_big_anextents; __be16 di_nrext64_pad; } __packed; } __packed; xfs_extnum_t di_nextents; xfs_aextnum_t di_anextents; __uint8_t di_forkoff; __int8_t di_aformat; __uint32_t di_dmevmask; __uint16_t di_dmstate; __uint16_t di_flags; __uint32_t di_gen; /* di_next_unlinked is the only non-core field in the old dinode */ __be32 di_next_unlinked; /* version 5 filesystem (inode version 3) fields start here */ __le32 di_crc; __be64 di_changecount; __be64 di_lsn; __be64 di_flags2; union { __be32 di_cowextsize; __be32 di_used_blocks; }; __u8 di_pad2[12]; xfs_timestamp_t di_crtime; __be64 di_ino; uuid_t di_uuid; }; ---- *di_magic*:: The inode signature; these two bytes are ``IN'' (0x494e). *di_mode*:: Specifies the mode access bits and type of file using the standard S_Ixxx values defined in stat.h. *di_version*:: Specifies the inode version which currently can only be 1, 2, or 3. The inode version specifies the usage of the +di_onlink+, +di_nlink+ and +di_projid+ values in the inode core. Initially, inodes are created as v1 but can be converted on the fly to v2 when required. v3 inodes are created only for v5 filesystems. *di_format*:: Specifies the format of the data fork in conjunction with the +di_mode+ type. This can be one of several values. For directories and links, it can be ``local'' where all metadata associated with the file is within the inode; ``extents'' where the inode contains an array of extents to other filesystem blocks which contain the associated metadata or data; or ``btree'' where the inode contains a B+tree root node which points to filesystem blocks containing the metadata or data. Migration between the formats depends on the amount of metadata associated with the inode. ``dev'' is used for character and block devices while ``uuid'' is currently not used. ``rmap'' indicates that a reverse-mapping B+tree is rooted in the fork. [source, c] ---- typedef enum xfs_dinode_fmt { XFS_DINODE_FMT_DEV, XFS_DINODE_FMT_LOCAL, XFS_DINODE_FMT_EXTENTS, XFS_DINODE_FMT_BTREE, XFS_DINODE_FMT_UUID, XFS_DINODE_FMT_RMAP, XFS_DINODE_FMT_REFCOUNT } xfs_dinode_fmt_t; ---- *di_onlink*:: In v1 inodes, this specifies the number of links to the inode from directories. When the number exceeds 65535, the inode is converted to v2 and the link count is stored in +di_nlink+. *di_metatype*:: If the +XFS_SB_FEAT_INCOMPAT_METADIR+ feature is enabled, the +di_onlink+ field is redefined to declare the intended contents of files in the metadata directory tree. [source, c] ---- enum xfs_metafile_type { XFS_METAFILE_USRQUOTA, XFS_METAFILE_GRPQUOTA, XFS_METAFILE_PRJQUOTA, XFS_METAFILE_RTBITMAP, XFS_METAFILE_RTSUMMARY, XFS_METAFILE_RTRMAP, XFS_METAFILE_RTREFCOUNT, }; ---- *di_uid*:: Specifies the owner's UID of the inode. *di_gid*:: Specifies the owner's GID of the inode. *di_nlink*:: Specifies the number of links to the inode from directories. This is maintained for both inode versions for current versions of XFS. Prior to v2 inodes, this field was part of +di_pad+. *di_projid*:: Specifies the owner's project ID in v2 inodes. An inode is converted to v2 if the project ID is set. This value must be zero for v1 inodes. *di_projid_hi*:: Specifies the high 16 bits of the owner's project ID in v2 inodes, if the +XFS_SB_VERSION2_PROJID32BIT+ feature is set; and zero otherwise. *di_pad[6]*:: Reserved, must be zero. Only exists for v2 inodes. *di_flushiter*:: Incremented on flush. Only exists for v2 inodes. *di_v3_pad*:: Must be zero for v3 inodes without the NREXT64 flag set. *di_big_nextents*:: Specifies the number of data extents associated with this inode if the NREXT64 flag is set. This allows for up to 2^48^ - 1 extent mappings. *di_atime*:: Specifies the last access time of the files using UNIX time conventions the following structure. This value may be undefined if the filesystem is mounted with the ``noatime'' option. XFS supports timestamps with nanosecond resolution: [source, c] ---- struct xfs_timestamp { __int32_t t_sec; __int32_t t_nsec; }; ---- If the +XFS_SB_FEAT_INCOMPAT_BIGTIME+ feature is enabled, the 64 bits used by the timestamp field are interpreted as a flat 64-bit nanosecond counter. See the section about xref:Inode_Timestamps[inode timestamps] for more details. *di_mtime*:: Specifies the last time the file was modified. *di_ctime*:: Specifies when the inode's status was last changed. *di_size*:: Specifies the EOF of the inode in bytes. This can be larger or smaller than the extent space (therefore actual disk space) used for the inode. For regular files, this is the filesize in bytes, directories, the space taken by directory entries and for links, the length of the symlink. *di_nblocks*:: Specifies the number of filesystem blocks used to store the inode's data including relevant metadata like B+trees. This does not include blocks used for extended attributes. *di_extsize*:: Specifies the extent size for filesystems with real-time devices or an extent size hint for standard filesystems. For normal filesystems, and with directories, the +XFS_DIFLAG_EXTSZINHERIT+ flag must be set in +di_flags+ if this field is used. Inodes created in these directories will inherit the di_extsize value and have +XFS_DIFLAG_EXTSIZE+ set in their +di_flags+. When a file is written to beyond allocated space, XFS will attempt to allocate additional disk space based on this value. *di_nextents*:: Specifies the number of data extents associated with this inode if the NREXT64 flag is not set. Supports up to 2^31^ - 1 extents. *di_anextents*:: Specifies the number of extended attribute extents associated with this inode if the NREXT64 flag is not set. Supports up to 2^15^ - 1 extents. *di_big_anextents*:: Specifies the number of extended attribute extents associated with this inode if the NREXT64 flag is set. Supports up to 2^32^ - 1 extents. *di_nrext64_pad*:: Must be zero if the NREXT64 flag is set. *di_forkoff*:: Specifies the offset into the inode's literal area where the extended attribute fork starts. This is an 8-bit value that is multiplied by 8 to determine the actual offset in bytes (ie. attribute data is 64-bit aligned). This also limits the maximum size of the inode to 2048 bytes. This value is initially zero until an extended attribute is created. When in attribute is added, the nature of +di_forkoff+ depends on the +XFS_SB_VERSION2_ATTR2BIT+  flag in the superblock. Refer to xref:Extended_Attribute_Versions[Extended Attribute Versions] for more details. *di_aformat*:: Specifies the format of the attribute fork. This uses the same values as +di_format+, but restricted to ``local'', ``extents'' and ``btree'' formats for extended attribute data. *di_dmevmask*:: DMAPI event mask. *di_dmstate*:: DMAPI state. *di_flags*:: Specifies flags associated with the inode. This can be a combination of the following values: .Version 2 Inode flags [options="header"] |===== | Flag | Description | +XFS_DIFLAG_REALTIME+ | The inode's data is located on the real-time device. | +XFS_DIFLAG_PREALLOC+ | The inode's extents have been preallocated. | +XFS_DIFLAG_NEWRTBM+ | Specifies the +sb_rbmino+ uses the new real-time bitmap format | +XFS_DIFLAG_IMMUTABLE+ | Specifies the inode cannot be modified. | +XFS_DIFLAG_APPEND+ | The inode is in append only mode. | +XFS_DIFLAG_SYNC+ | The inode is written synchronously. | +XFS_DIFLAG_NOATIME+ | The inode's +di_atime+ is not updated. | +XFS_DIFLAG_NODUMP+ | Specifies the inode is to be ignored by xfsdump. | +XFS_DIFLAG_RTINHERIT+ | For directory inodes, new inodes inherit the +XFS_DIFLAG_REALTIME+ bit. | +XFS_DIFLAG_PROJINHERIT+ | For directory inodes, new inodes inherit the +di_projid+ value. | +XFS_DIFLAG_NOSYMLINKS+ | For directory inodes, symlinks cannot be created. | +XFS_DIFLAG_EXTSIZE+ | Specifies the extent size for real-time files or an extent size hint for regular files. | +XFS_DIFLAG_EXTSZINHERIT+ | For directory inodes, new inodes inherit the +di_extsize+ value. | +XFS_DIFLAG_NODEFRAG+ | Specifies the inode is to be ignored when defragmenting the filesystem. | +XFS_DIFLAG_FILESTREAMS+ | Use the filestream allocator. The filestreams allocator allows a directory to reserve an entire allocation group for exclusive use by files created in that directory. Files in other directories cannot use AGs reserved by other directories. |===== *di_gen*:: A generation number used for inode identification. This is used by tools that do inode scanning such as backup tools and xfsdump. An inode's generation number can change by unlinking and creating a new file that reuses the inode. *di_next_unlinked*:: See the section on xref:Unlinked_Pointer[unlinked inode pointers] for more information. *di_crc*:: Checksum of the inode. *di_changecount*:: Counts the number of changes made to the attributes in this inode. *di_lsn*:: Log sequence number of the last inode write. *di_flags2*:: Specifies extended flags associated with a v3 inode. .Version 3 Inode flags [options="header"] |===== | Flag | Description | +XFS_DIFLAG2_DAX+ | For a file, enable DAX to increase performance on persistent-memory storage. If set on a directory, files created in the directory will inherit this flag. | +XFS_DIFLAG2_REFLINK+ | This inode shares (or has shared) data blocks with another inode. | +XFS_DIFLAG2_COWEXTSIZE+ | For files, this is the extent size hint for copy on write operations; see +di_cowextsize+ for details. For directories, the value in +di_cowextsize+ will be copied to all newly created files and directories. | +XFS_DIFLAG2_NREXT64+ | Files with this flag set may have up to (2^48^ - 1) extents mapped to the data fork and up to (2^32^ - 1) extents mapped to the attribute fork. This flag requires the +XFS_SB_FEAT_INCOMPAT_NREXT64+ feature to be enabled. | +XFS_DIFLAG2_METADATA+ | This file contains filesystem metadata. This feature requires the +XFS_SB_FEAT_INCOMPAT_METADIR+ feature to be enabled. See the section about xref:Metadata_Directories[metadata directories] for more information on metadata inode properties. Only directories and regular files can have this flag set. |===== *di_cowextsize*:: Specifies the extent size hint for copy on write operations. When allocating extents for a copy on write operation, the allocator will be asked to align its allocations to either +di_cowextsize+ blocks or +di_extsize+ blocks, whichever is greater. The +XFS_DIFLAG2_COWEXTSIZE+ flag must be set if this field is used. If this field and its flag are set on a directory file, the value will be copied into any files or directories created within this directory. During a block sharing operation, this value will be copied from the source file to the destination file if the sharing operation completely overwrites the destination file's contents and the destination file does not already have +di_cowextsize+ set. *di_used_blocks*:: Used only for the xref:Real_time_Reverse_Mapping_Btree[Reverse-Mapping B+tree] inode on filesystems with a xref:Zoned[Zoned Real-time Device]. Tracks the number of filesystem blocks in the rtgroup that have been written but not unmapped, i.e. the number of blocks that are referenced by at least one rmap entry. *di_pad2*:: Padding for future expansion of the inode. *di_crtime*:: Specifies the time when this inode was created. *di_ino*:: The full inode number of this inode. *di_uuid*:: The UUID of this inode, which must match either +sb_uuid+ or +sb_meta_uuid+ depending on which features are set. [[Unlinked_Pointer]] == Unlinked Pointer The +di_next_unlinked+ value in the inode is used to track inodes that have been unlinked (deleted) but are still open by a program. When an inode is in this state, the inode is added to one of the xref:AG_Inode_Management[AGI's] +agi_unlinked+ hash buckets. The AGI unlinked bucket points to an inode and the +di_next_unlinked+ value points to the next inode in the chain. The last inode in the chain has +di_next_unlinked+ set to NULL (-1). Once the last reference is released, the inode is removed from the unlinked hash chain and +di_next_unlinked+ is set to NULL. In the case of a system crash, XFS recovery will complete the unlink process for any inodes found in these lists. The only time the unlinked fields can be seen to be used on disk is either on an active filesystem or a crashed system. A cleanly unmounted or recovered filesystem will not have any inodes in these unlink hash chains. .Unlinked inode pointer image::images/28.png[] [[Data_Fork]] == Data Fork The structure of the inode's data fork based is on the inode's type and +di_format+. The data fork begins at the start of the inode's ``literal area''. This area starts at offset 100 (0x64), or offset 176 (0xb0) in a v3 inode. The size of the data fork is determined by the type and format. The maximum size is determined by the inode size and +di_forkoff+. In code, use the +XFS_DFORK_PTR+ macro specifying +XFS_DATA_FORK+ for the ``which'' parameter. Alternatively, the +XFS_DFORK_DPTR+ macro can be used. Each of the following sub-sections summarises the contents of the data fork based on the inode type. [[Regular_Files_S_IFREG]] === Regular Files (S_IFREG) The data fork specifies the file's data extents. The extents specify where the file's actual data is located within the filesystem. Extents can have 2 formats which is defined by the di_format value: * +XFS_DINODE_FMT_EXTENTS+: The extent data is fully contained within the inode which contains an array of extents to the filesystem blocks for the file's data. To access the extents, cast the return value from +XFS_DFORK_DPTR+ to +xfs_bmbt_rec_t*+. * +XFS_DINODE_FMT_BTREE+: The extent data is contained in the leaves of a B+tree. The inode contains the root node of the tree and is accessed by casting the return value from +XFS_DFORK_DPTR+ to +xfs_bmdr_block_t*+. Details for each of these data extent formats are covered in the xref:Data_Extents[Data Extents] later on. [[Directories_S_IFDIR]] === Directories (S_IFDIR) The data fork contains the directory's entries and associated data. The format of the entries is also determined by the +di_format+ value and can be one of 3 formats: * +XFS_DINODE_FMT_LOCAL+: The directory entries are fully contained within the inode. This is accessed by casting the value from +XFS_DFORK_DPTR+ to +xfs_dir2_sf_t*+. * +XFS_DINODE_FMT_EXTENTS+: The actual directory entries are located in another filesystem block, the inode contains an array of extents to these filesystem blocks (+xfs_bmbt_rec_t*+). * +XFS_DINODE_FMT_BTREE+: The directory entries are contained in the leaves of a B+tree. The inode contains the root node (+xfs_bmdr_block_t*+). Details for each of these directory formats are covered in the xref:Directories[Directories] later on. [[Symbolic_Links_S_IFLNK]] === Symbolic Links (S_IFLNK) The data fork contains the contents of the symbolic link. The format of the link is determined by the +di_format+ value and can be one of 2 formats: * +XFS_DINODE_FMT_LOCAL+: The symbolic link is fully contained within the inode. This is accessed by casting the return value from +XFS_DFORK_DPTR+ to +char*+. * +XFS_DINODE_FMT_EXTENTS+: The actual symlink is located in another filesystem block, the inode contains the extents to these filesystem blocks (+xfs_bmbt_rec_t*+). Details for symbolic links is covered in the section about xref:Symbolic_Links[Symbolic Links]. [[Other_File_Types]] === Other File Types For character and block devices (+S_IFCHR+ and +S_IFBLK+), cast the value from +XFS_DFORK_DPTR+ to +xfs_dev_t*+. [[Attribute_Fork]] == Attribute Fork The attribute fork in the inode always contains the location of the extended attributes associated with the inode. The location of the attribute fork in the inode's literal area is specified by the +di_forkoff+ value in the inode's core. If this value is zero, the inode does not contain any extended attributes. If non-zero, the attribute fork's byte offset into the literal area can be computed from +di_forkoff × 8+. Attributes must be allocated on a 64-bit boundary on the disk. To access the extended attributes in code, use the +XFS_DFORK_PTR+ macro specifying +XFS_ATTR_FORK+ for the ``which'' parameter. Alternatively, the +XFS_DFORK_APTR+ macro can be used. The structure of the attribute fork depends on the +di_aformat+ value in the inode. It can be one of the following values: * +XFS_DINODE_FMT_LOCAL+: The extended attributes are contained entirely within the inode. This is accessed by casting the value from +XFS_DFORK_APTR+ to +xfs_attr_shortform_t*+. * +XFS_DINODE_FMT_EXTENTS+: The attributes are located in another filesystem block, the inode contains an array of pointers to these filesystem blocks. They are accessed by casting the value from +XFS_DFORK_APTR+ to +xfs_bmbt_rec_t*+. * +XFS_DINODE_FMT_BTREE+: The extents for the attributes are contained in the leaves of a B+tree. The inode contains the root node of the tree and is accessed by casting the value from +XFS_DFORK_APTR+ to +xfs_bmdr_block_t*+. Detailed information on the layouts of extended attributes are covered in the xref:Extended_Attributes[Extended Attributes] in this document. [[Extended_Attribute_Versions]] === Extended Attribute Versions Extended attributes come in two versions: ``attr1'' or ``attr2''. The attribute version is specified by the +XFS_SB_VERSION2_ATTR2BIT+  flag in the +sb_features2+ field in the superblock. It determines how the inode's extra space is split between +di_u+ and +di_a+ forks which also determines how the +di_forkoff+ value is maintained in the inode's core. With ``attr1'' attributes, the +di_forkoff+ is set to somewhere in the middle of the space between the core and end of the inode and never changes (which has the effect of artificially limiting the space for data information). As the data fork grows, when it gets to +di_forkoff+, it will move the data to the next format level (ie. local < extent < btree). If very little space is used for either attributes or data, then a good portion of the available inode space is wasted with this version. ``attr2'' was introduced to maximum the utilisation of the inode's literal area. The +di_forkoff+ starts at the end of the inode and works its way to the data fork as attributes are added. Attr2 is highly recommended if extended attributes are used. The following diagram compares the two versions: .Extended attribute layouts image::images/30.png[] Note that because +di_forkoff+ is an 8-bit value measuring units of 8 bytes, the maximum size of an inode is 2^8^ × 2^3^ = 2^11^ = 2048 bytes.