LVM, LUKS2 & BTRFS Problem

Question

I had the following setup: 3 HDDs 10Tb of size, in LVM Raid5 configuration On top a LUKS2 encryption and inside a BTRFS filesystem.

Since my storage got low i added another 16TB HDD (was cheaper than 10TB) added it as physcial volume in LVM, added it to the volume group, ran a resync, so that LVM can adjust the size of my RAID. I resized the btrfs partition to max.

I noticed, that in dmesg errors began to appear shortly after the btrfs resize when i write to it:

[53034.840728] btrfs_dev_stat_print_on_error: 299 callbacks suppressed
[53034.840731] BTRFS error (device dm-15): bdev /dev/mapper/data errs: wr 807, rd 0, flush 0, corrupt 0, gen 0
[53034.841289] BTRFS error (device dm-15): bdev /dev/mapper/data errs: wr 808, rd 0, flush 0, corrupt 0, gen 0
[53034.844993] BTRFS error (device dm-15): bdev /dev/mapper/data errs: wr 809, rd 0, flush 0, corrupt 0, gen 0
[53034.845893] BTRFS error (device dm-15): bdev /dev/mapper/data errs: wr 810, rd 0, flush 0, corrupt 0, gen 0
[53034.846154] BTRFS error (device dm-15): bdev /dev/mapper/data errs: wr 811, rd 0, flush 0, corrupt 0, gen 0

I can exclude hardware problems, since i tried that on another computer in a virtual machine. The problems in dmesg do appear when i write bigger files (400Mb) to the filesystem, but not something like a text file - the checksum is also wrong after a copy from one file of the raid to another:

gallifrey raid5 # dd if=/dev/urandom of=original.img bs=40M count=100
0+100 records in
0+100 records out
3355443100 bytes (3.4 GB, 3.1 GiB) copied, 54.0163 s, 62.1 MB/s
gallifrey raid5 # cp original.img copy.img
gallifrey raid5 # md5sum original.img copy.img            
29867131c09cc5a6e8958b2eba5db4c9  original.img
59511b99494dd4f7cf1432b19f4548c4  copy.img

gallifrey raid5 # btrfs device stats /mnt/raid5 
[/dev/mapper/data].write_io_errs    811
[/dev/mapper/data].read_io_errs     0
[/dev/mapper/data].flush_io_errs    0
[/dev/mapper/data].corruption_errs  0
[/dev/mapper/data].generation_errs  0

I already resynced the entire lvm raid, did a smartctl checkup multiple times (shouldn't be a hw problem, but still) and did btrfs scrub start -B /mnt/raid5 and btrfs check -p --force /dev/mapper/data while non of them returned any error whatsoever.

Hapened on kernel 5.15.11 and 5.10.27

lvm version:

gallifrey raid5 # lvm version  
  LVM version:     2.02.188(2) (2021-05-07)
  Library version: 1.02.172 (2021-05-07)
  Driver version:  4.45.0

My goal is that future writes to the drive are non-corrupted, while the already corrupted files can be deleted, but the good files I would like to save or at least not delete.

From the man page of btrfs it says, that write_io_errs means that the block device beneath does not succeed in writing. In my case that means, that lvm and or luks2 is the problem here.

Any suggestions, or any more information needed?

Cheers

Gallifrey · Accepted Answer · 2022-01-12 16:45:25Z

I could not find the root cause of this problem, so I decided to ditch LVM in it's entirety and replace it with mdadm - which worked like charm on the first try.

Creating mdadm RAID5 (initially with 3 disks)

Creating with three disks (henceworth raid-devices = 3):

mdadm --create mediaraid --level=raid5 --raid-devices=3 /dev/sda /dev/sdb /dev/sde

Optionally checking what encryption you can use at what speed (memory speed, not disk IO):

cryptsetup benchmark /dev/md/mediaraid

Optionally encrypting the entire RAID (a construct like this does not require to decrypt every disk on its own. One password for the ENTIRE RAID):

cryptsetup luksFormat --hash sha512 --cipher aes-xts-plain64 --key-size 512 /dev/md/mediaraid

Opening the LUKS device (necessary for formatting it):

cryptsetup luksOpen /dev/md/mediaraid

Format the RAID with btrfs:

mkfs.btrfs /dev/mapper/data -f

Growing/Expanding a btrfs filesystem by 1 disk and an underlying mdadm RAID5

Preconditions: Filesystem is not mounted and LUKS device is closed:

umount /mnt/raid5 && cryptsetup close /dev/mapper/data

Adding /dev/sdc (replace with your drive) to mdadm as a spare disk:

mdadm --add /dev/md/mediaraid /dev/sdc

Verify it shows up (will be at the bottom, saying it is a spare disk):

mdadm --detail /dev/md/mediaraid

Note: The following step triggers a RAID reshape, things are getting real, my 10TB hard drives took about 25-30 hours to reshape and sync from 3 to 4 disks. I am not sure if a reboot is safe during the respahe - but I wouldn't recommend it or at least try it in a virtual machine.

Grow the RAID to the number of disks (most of the time you want to write the total count of disks here, 3 + 1 = 4, now I have 4 drives available and I want to use ALL 4 of them):

mdadm --grow --raid-devices=4 /dev/md/mediaraid

Monitor progress of reshape (first one is better):

cat /proc/mdstat or mdadm --detail /dev/md/mediaraid

After it is done reshaping:

Optionally, if you use LUKS Decrypt the RAID - else continue with the next step:

cryptsetup luksOpen /dev/md/mediaraid data

Mount the btrfs filesystem:

mount /dev/mapper/data /mnt/raid5

Grow the btrfs filesystem to max or whatever you want:

btrfs filesystem resize max /mnt/raid5

It might be not necessary but I unmounted and remounted the entire thing after the btrfs filesystem resize

umount /mnt/raid5 && mount /dev/maper/data /mnt/raid5

Done.

gapsf · Accepted Answer · 2022-09-26 15:52:23Z

1

Warning about raid5+dm-crypt+btrfs.

mdraid is much better than lvm raid. But raid5 is worst raid level.

Raid5 is horrible for write performance and for robustness. In degraded mode even worse.

btrfs is redundant by itself so btrfs on top of raid5 is wired usage.

But also btrfs is brittle and not robust, have wired redundancy behavior.

So raid5+dm-crypt+btrfs is more horrible setup - you have worst write performance and eventually lost you data.

When one disk failed syncing 10tb will go to infinity on raid5. Or when os crashes due bug or power failure btrfs get corrupted by itself or due dm-crypt, raid5 corruptions (os page cache, disks hardware cache, raid write hole)

Whats wrong with btrfs

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/

edited Sep 26, 2022 at 15:52

answered Sep 26, 2022 at 14:27

gapsf

6244 silver badges7 bronze badges

Thanks for your comment. The reason I have RAID5 is, it is cheap and I can lose 1 hard drive. The slow write speed is not great, but I don't want to mirror every disk. For the issue with BTRFS, I can say I had multiple blackouts over the year with a running system and did not lose any data and generally had no issues with it. Do you have a better solution (which is also cheap)?

Gallifrey
– Gallifrey

2022-09-27 10:50:47 +00:00
Commented Sep 27, 2022 at 10:50
There is still no much choice today: use disks directly without any raid. 4 disks - mdraid10 instead of mdraid5 or btrfs-raid10 directly on disks. 3 disks - zfs with raidz pool directly on disks without lvm and raid.

gapsf
– gapsf

2022-09-27 11:55:03 +00:00
Commented Sep 27, 2022 at 11:55
Raid10 costs to much. BTRFS Raid is generally a bad idea, at least many sources say so. Can I expand a zfs pool with more disks as time goes on AND encrypt my data with LUKS?

Gallifrey
– Gallifrey

2022-09-28 12:04:53 +00:00
Commented Sep 28, 2022 at 12:04
zfs topology arstechnica.com/information-technology/2020/05/…

gapsf
– gapsf

2022-09-28 14:24:22 +00:00
Commented Sep 28, 2022 at 14:24
Expending raidz vdev still not implemenyed so pool grow only by adding new raidz vdev arstechnica.com/gadgets/2021/06/…

gapsf
– gapsf

2022-09-28 14:27:57 +00:00
Commented Sep 28, 2022 at 14:27

| Show 1 more comment

calestyo · Accepted Answer · 2022-09-26 03:10:17Z

0

Could your problem have been this bug in btrfs-progs?

https://lore.kernel.org/linux-btrfs/[email protected]/

answered Sep 26, 2022 at 3:10

calestyo

1451 silver badge5 bronze badges

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review

dr_
– dr_

2022-09-26 09:12:49 +00:00
Commented Sep 26, 2022 at 9:12

Add a comment |

Stack Exchange Network

LVM, LUKS2 & BTRFS Problem

3 Answers 3

Creating mdadm RAID5 (initially with 3 disks)

Growing/Expanding a btrfs filesystem by 1 disk and an underlying mdadm RAID5

You must log in to answer this question.

Hot Network Questions

LVM, LUKS2 & BTRFS Problem

3 Answers 3

Creating mdadm RAID5 (initially with 3 disks)

Growing/Expanding a btrfs filesystem by 1 disk and an underlying mdadm RAID5

You must log in to answer this question.

Related

Hot Network Questions