Nothing strikes fear into the heart of a system administrator or end user like logging into a system and finding their content GONE. As storage has shifted from hard drives to FLASH memory and file systems have become more reliable, loss of data from random failures has become far less likely to occur. However, there are a variety of administrative mistakes -- beyond accidental erase / remove commands -- that can APPEAR to result in missing data. Given the overall reliability of underlying hardware, it is useful to understand scenarios that can produce symptoms of data loss and learn troubleshooting techniques that can identify root cause and allow them to be resolved.

This post illustrates an administrative mistake involving an installation of OpenSSL that broke key processes within a Linux system that resulted in all content of the system's /home directory DISAPPEARING, along with other directories tied to remote NFS and Samba volumes. The problem took about 90 minutes to diagnose and and ended with zero actual data loss.

The Original Administrative Mistake

This problem started after a new installation of OpenSSL was added to a Linux host running Fedora 43 so the OpenSSL libraries could be referenced when compiling / building a version of Python from source. To make the library directory of the OpenSSL build easy to reference in the Python build, OpenSSL was built from source and installed at /opt/openssl341. To ensure the OpenSSL shared modules were usable in the Python build, the /opt/openssl341/lib64 directory was added to the dynamic linker configuration file at /etc/ld.so.conf and the linker configuration reloaded by running /sbin/ldconfig to pick up the new library directory.

This allowed Python to be compiled but the Fedora host was not restarted after updating the dynamic linker. This deferred recognition of the fact that the new OpenSSL library modules were incompatible with various operating system components that used its libcrypto.so module.

How The Mistake Broke the System Boot

Many different modules for handling user authorization, logging and file system management rely upon the libcrypto module to function. Under the covers, the SELinux (Security Enhanced Linux) layer had detected differences between the new libcrypto.so module seen via the dynnamic linker and the rest of the openssl installation the OS was using and BLOCKED all access to the libcrypto.so module. That caused numerous processes to fail during startup. The kernel saw these failures and altered the boot configuration to mount the system's primary drive read-only instead of read-write.

Since the unexpected version of libcrypto.so broke the User Database Service (systemd-userdbd.service), higher level operating system functions requiring that layer to be functioning to control access to other processes and resources failed and could not perform required functions. One of those functions involves displaying entries in the filesystem. As a result, attempting to list any directories within /home which are owned by non-root users didn't just return the directory information with weird integer ID values for the owner and group, such listings returned NOTHING. This gave the impression the content was GONE, rather than merely unreadable or unwriteable.

Of course, enforcement of user-based security is also crucial to creating links to external file systems using NFS and Samba. As a result, the directory /gitrepo linked to a remote NFS storage volume on a TrueNAS server was not connected, making it appear like all local git repository content had been lost. Another directory /smb used to backup non-source related content was also unable to connect making THAT content appear to have disappeared as well.

Luckily, the larger environment was configured with other Fedora and Windows systems with similar connections to those remote NFS and Samba volumes and all of those connectiosn worked, proving the content was present and had not been lost. That made it easier to focus on finding a cause that could be corrected without loss of data.

Finding the Underlying Fault

The first problem that became apparent with the failed boot is that some of the first common sources of diagnostics such as /var/log/warn or /var/log/last that identify events in the most recent boot had no new content. They couldn't because the entire machine volume had been mounted read-only. Instead, the journalctl command provided similar details and quickly pointed to openssl being involved.

The first set of log messages that pointed out a problem involved HUNDREDS of these messages that were generated around the time the new version of OpenSSL was first installed and the dynamic linker configuration updated the day before.

Apr 23 15:00:20 fedora1 systemd-userdbd[82099]: /usr/lib/systemd/systemd-userwork: error while loading shared libraries: libcrypto.so.3: failed to map segment from sha>
Apr 23 15:00:20 fedora1 systemd-userdbd[577]: Worker 82099 died with a failure exit status 127, ignoring.
Apr 23 15:00:20 fedora1 systemd-userdbd[82100]: /usr/lib/systemd/systemd-userwork: error while loading shared libraries: libcrypto.so.3: failed to map segment from sha>
Apr 23 15:00:20 fedora1 systemd-userdbd[577]: Worker 82100 died with a failure exit status 127, ignoring.

After jumping ahead with a search in the output of the journalctl command to the current time around the most recent failed boot, error messages like these were seen in the logs:

Apr 23 16:27:05 fedora1 setroubleshoot[199695]: SELinux is preventing systemd-hostnam from 
execute access on the file /opt/openssl341/lib64/libcrypto.so.3.

So these clearly identified that the libcrypto.so.3 module was at fault and the specific location of that module was the NEW OpenSSL installation just added the prior day. Correcting the problem required pointing the system away from the new OpenSSL installation. The existing OS installation binaries and libraries had not be altered, only bypassed via the system $PATH and the dynamic linker configuration. Rolling back should be straightforward.

Right? Maybe. Maybe not.

Disabling the Faulty OpenSSL Installation

Since the server host altered the file system configuration to mount the main volume read-only, the /etch/bashrc controlling the system's default $PATH and the /etc/ld.so.conf configuration controlling the dymamic linker could be SEEN but they could not be EDITED. In order to alter the files and hide the presence of the /opt/openssl341 directory, the boot command specified on the GRUB menu at boot had to be altered to explicitly force the volume to boot in rw mode rather than ro mode.

In this case, the Fedora machine was a virtual machine guest running under ProxMox. "Console" access wasn't provided by direct connection with a keyboard and monitor to a physical machine but instead by the "Console" function within the ProxMOx administrative GUI at http://192.168.99.2:8006/. That allowed access to the GRUB menu displayed during boot so the boot command could be edited. The actual boot command looked like this

root=UUID=7d825ab0-3b7b-44de-8c2e-8f0c97a5cefb ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau "

and was changed to this:

root=UUID=7d825ab0-3b7b-44de-8c2e-8f0c97a5cefb rw rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau "

With that alteration, the system booted with read-write access to the /etc directory allowing the broken OpenSSL installation path references to be removed from $PATH and the dynamic linker. It actually proved to be the change to the dynamic linker that corrected the fault and allowed the system to boot cleanly in read-write mode.

NOTE. After altering the boot configuration in GRUB to specify rw instead of ro, after the system rebooted successfully, the boot option returned to ro (read-only). How is the system working if the boot configuration is telling the system to start in read-only mode? By default, the boot process WILL mount the system disk in read-only so processes running as part of the initrd (initialization RAM disk) can examine the volume to see if it was dismounted cleanly at shutdown or needs a file system check scan run. That initrd logic will alter the access mode to read-write if no issues are found. Forcing the access mode to read-write will cause initrd to leave it read-write, even if lower level failures are found during the file system check. This allows read-write access when the full OS boots. Extreme caution is required any time a volume is forced to read-write mode, however.

Key Lessons Worth Re-Learning

When systems operate well for extended periods of time, it can be very easy to forget old best-practices and even easier to forget key diagnostic techniques required to correct issues. The following lessons are worth highlighting from this particular fire drill.

Manage the OS Installation of OpenSSL Separately Than "User" Installations -- The OS installation of OpenSSL on Linux operating systems is crucial to MANY aspects of system startup and ongoing security. Changes that versions of binaries or shared libraries can trigger complex failures at reboot. If a different version of OpenSSL is needed for "user" purposes, build it into a user directory and alter user-specific environment settings to use that installation for end-user functions.
Always Include a Reboot When Altering the OS OpenSSL Installation -- Anything that breaks the OS installation of OpenSSL can trigger these failures at reboot, rendering a machine potentially unreachable except by console, making it vastly more difficult to fix. If a reboot is performed IMMEDIATELY after updating the OS OpenSSL, the likely cause will be immediately obvious. If a reboot is NOT performed until days / weeks later and the system because unreachable or unusable, the failure will generate MUCH more confusion and take longer to troubleshoot and resolve.
Treat Python Installations the Same as OpenSSL Installations -- Many Linux distributions have adopted Python for use in many of their package administration utilities and desktop related functions. Making changes to the "OS" intance of Python for use with user-level projects is NOT wise. Install "user" builds of Python as an end-user and update user-level $PATH settings to use the user instance instead of the system instance.
Remember Partial Absence of Subdirectories Can be Security Related Rather than Hardware Related -- If a Linux machine boots in read-only mode, missing content tied to specific userids is NOT likely a hardware fault and is likely recoverable. DO NOT give up on the "lost" data and DO NOT confuse the possible recovery by immediately attempting to find replacement copies from other media. Try to cure the logical problem first and exhaust all possibilities.

mdhLabs

Thursday, April 24, 2025

Recovering From a Read-Only Linux Boot

The Original Administrative Mistake

How The Mistake Broke the System Boot

Finding the Underlying Fault

Disabling the Faulty OpenSSL Installation

Key Lessons Worth Re-Learning