Thursday, August 7, 2014

Some notes about testing network block devices

I tested another random thing: network boot options short of iscsi. The first thing I tried was mounting the root filesystem from the startup environment over network block device. That was extremely simple and worked fine. There is a likely shutdown glitch, but if the root filesystem is synced the effect should be relatively minor. I am using xfs and it recovers quickly from that kind of incomplete shutdown relatively easily; well, at least it did in my limited tests. Another interesting idea for testing is to check if it is somehow possible to mount the root filesystem using ssh+nbd. That introduces some extra complications for the root filesystem, but it could definitely be adapted to encrypt traffic for another mount point, such as /home. I guess this may become practical for those who have fiber optic connections. The technique I tried is similar to sshfs, but uses only the the default tools.

Here is a rough idea of how to do this. First, login at the computer hosting the image that is to be mounted somewhere in the filesystem. Start the network block device accessible only locally. Add more local security here, if necessary.

ssh -K douglas.mayne@somecomputer
losetup -f somefile.img
nbd-server  127.0.0.1@2000 -C /root/non /dev/loop1

Then setup the connection back at the computer that would like to use this network disk:

modprobe nbd
ssh -K douglas.mayne@somecomputer -n -L 2500:localhost:2000
nbd-client localhost 2500 /dev/nbd0
mount /dev/nbd0 /home

For the encrypted traffic for the root filesystem, I am wondering if a strategy based on pivot_root might work.

On Friday, I tested this a bit more. I missed the obvious solution for encrypting the root filesystem using network block device. The obvious answer is to use device mapper on top of a network block device. I tested this and it works in the standard way.

dm-live issue and fix. Also debugging with kexec

I noticed that successful bootup to a root filesystem connected via USB was broken on certain hardware. I initially assumed that the failure might be associated with a kernel bug, or perhaps to do with specific motherboard chipsets, or a flaw in my understanding of udev. I was stumped and invested some time to figure out what was wrong. This screenshot shows a typical bootup failure. Googling the failure shows others are experiencing the same troubles, perhaps for similar reasons. I didn't see any obvious solutions. For the last while I have mostly been working around the problem by using a root filesystem connected directly over SATA.

I stumbled on the solution by chance:

Include the ehci-pci module among those that are loaded at boot. That ensures the best performance for the external drive, be it a flash disc or a magnetic disc.

Because booting to a root filesystem on USB is the heart of a live USB install, it's good to know that it was a case of PEBKAC and not something more endemic.

With that fix out of the way, there are some other changes to my dm-live startup environment including following along with the stable kernel releases in the 3.10.x series. Currently, I am testing with kernel 3.10.51.

By the way, in the debug phase of this problem, I was pointed to kdump. I played around with the kexec utility to see some of the basics of kernel debugging and how it is designed to work. I found right away that the basic kernel configuration that I have been using, which is extremely similar to the basic default Slackware kernel, is configured to not generate symbols, enable boot-time reservation of memory using the parameter crashkernel=, or be arbitrarily relocatable. To use kdump there are a few changes, including adding the full compiler symbols the compiled objects. In general, I agree with Slackware's decision not to include the symbols because they increase the size of the resulting compile by a factor of 5 or 6 times. In the end, throughout my attempt to debug the above problem, I wasn't able to generate the right kind of crash, i.e. one that actually writes its crash data. I was pretty sure I was doing it right because kexec -l and kexec -e were working as expected. The panic code when loaded with kexec -p just wasn't tripped for whatever reason. It was too hard of crash, I guess.