Thursday, April 24, 2025

Recovering From a Read-Only Linux Boot

Nothing strikes fear into the heart of a system administrator or end user like logging into a system and finding their content GONE. As storage has shifted from hard drives to FLASH memory and file systems have become more reliable, loss of data from random failures has become far less likely to occur. However, there are a variety of administrative mistakes -- beyond accidental erase / remove commands -- that can APPEAR to result in missing data. Given the overall reliability of underlying hardware, it is useful to understand scenarios that can produce symptoms of data loss and learn troubleshooting techniques that can identify root cause and allow them to be resolved.

This post illustrates an administrative mistake involving an installation of OpenSSL that broke key processes within a Linux system that resulted in all content of the system's /home directory DISAPPEARING, along with other directories tied to remote NFS and Samba volumes. The problem took about 90 minutes to diagnose and and ended with zero actual data loss.


The Original Administrative Mistake

This problem started after a new installation of OpenSSL was added to a Linux host running Fedora 43 so the OpenSSL libraries could be referenced when compiling / building a version of Python from source. To make the library directory of the OpenSSL build easy to reference in the Python build, OpenSSL was built from source and installed at /opt/openssl341. To ensure the OpenSSL shared modules were usable in the Python build, the /opt/openssl341/lib64 directory was added to the dynamic linker configuration file at /etc/ld.so.conf and the linker configuration reloaded by running /sbin/ldconfig to pick up the new library directory.

This allowed Python to be compiled but the Fedora host was not restarted after updating the dynamic linker. This deferred recognition of the fact that the new OpenSSL library modules were incompatible with various operating system components that used its libcrypto.so module.


How The Mistake Broke the System Boot

Many different modules for handling user authorization, logging and file system management rely upon the libcrypto module to function. Under the covers, the SELinux (Security Enhanced Linux) layer had detected differences between the new libcrypto.so module seen via the dynnamic linker and the rest of the openssl installation the OS was using and BLOCKED all access to the libcrypto.so module. That caused numerous processes to fail during startup. The kernel saw these failures and altered the boot configuration to mount the system's primary drive read-only instead of read-write.

Since the unexpected version of libcrypto.so broke the User Database Service (systemd-userdbd.service), higher level operating system functions requiring that layer to be functioning to control access to other processes and resources failed and could not perform required functions. One of those functions involves displaying entries in the filesystem. As a result, attempting to list any directories within /home which are owned by non-root users didn't just return the directory information with weird integer ID values for the owner and group, such listings returned NOTHING. This gave the impression the content was GONE, rather than merely unreadable or unwriteable.

Of course, enforcement of user-based security is also crucial to creating links to external file systems using NFS and Samba. As a result, the directory /gitrepo linked to a remote NFS storage volume on a TrueNAS server was not connected, making it appear like all local git repository content had been lost. Another directory /smb used to backup non-source related content was also unable to connect making THAT content appear to have disappeared as well.

Luckily, the larger environment was configured with other Fedora and Windows systems with similar connections to those remote NFS and Samba volumes and all of those connectiosn worked, proving the content was present and had not been lost. That made it easier to focus on finding a cause that could be corrected without loss of data.


Finding the Underlying Fault

The first problem that became apparent with the failed boot is that some of the first common sources of diagnostics such as /var/log/warn or /var/log/last that identify events in the most recent boot had no new content. They couldn't because the entire machine volume had been mounted read-only. Instead, the journalctl command provided similar details and quickly pointed to openssl being involved.

The first set of log messages that pointed out a problem involved HUNDREDS of these messages that were generated around the time the new version of OpenSSL was first installed and the dynamic linker configuration updated the day before.

Apr 23 15:00:20 fedora1 systemd-userdbd[82099]: /usr/lib/systemd/systemd-userwork: error while loading shared libraries: libcrypto.so.3: failed to map segment from sha>
Apr 23 15:00:20 fedora1 systemd-userdbd[577]: Worker 82099 died with a failure exit status 127, ignoring.
Apr 23 15:00:20 fedora1 systemd-userdbd[82100]: /usr/lib/systemd/systemd-userwork: error while loading shared libraries: libcrypto.so.3: failed to map segment from sha>
Apr 23 15:00:20 fedora1 systemd-userdbd[577]: Worker 82100 died with a failure exit status 127, ignoring.

After jumping ahead with a search in the output of the journalctl command to the current time around the most recent failed boot, error messages like these were seen in the logs:

Apr 23 16:27:05 fedora1 setroubleshoot[199695]: SELinux is preventing systemd-hostnam from 
execute access on the file /opt/openssl341/lib64/libcrypto.so.3.

So these clearly identified that the libcrypto.so.3 module was at fault and the specific location of that module was the NEW OpenSSL installation just added the prior day. Correcting the problem required pointing the system away from the new OpenSSL installation. The existing OS installation binaries and libraries had not be altered, only bypassed via the system $PATH and the dynamic linker configuration. Rolling back should be straightforward.

Right? Maybe. Maybe not.


Disabling the Faulty OpenSSL Installation

Since the server host altered the file system configuration to mount the main volume read-only, the /etch/bashrc controlling the system's default $PATH and the /etc/ld.so.conf configuration controlling the dymamic linker could be SEEN but they could not be EDITED. In order to alter the files and hide the presence of the /opt/openssl341 directory, the boot command specified on the GRUB menu at boot had to be altered to explicitly force the volume to boot in rw mode rather than ro mode.

In this case, the Fedora machine was a virtual machine guest running under ProxMox. "Console" access wasn't provided by direct connection with a keyboard and monitor to a physical machine but instead by the "Console" function within the ProxMOx administrative GUI at http://192.168.99.2:8006/. That allowed access to the GRUB menu displayed during boot so the boot command could be edited. The actual boot command looked like this

root=UUID=7d825ab0-3b7b-44de-8c2e-8f0c97a5cefb ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau "

and was changed to this:

root=UUID=7d825ab0-3b7b-44de-8c2e-8f0c97a5cefb rw rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau "

With that alteration, the system booted with read-write access to the /etc directory allowing the broken OpenSSL installation path references to be removed from $PATH and the dynamic linker. It actually proved to be the change to the dynamic linker that corrected the fault and allowed the system to boot cleanly in read-write mode.

NOTE. After altering the boot configuration in GRUB to specify rw instead of ro, after the system rebooted successfully, the boot option returned to ro (read-only). How is the system working if the boot configuration is telling the system to start in read-only mode? By default, the boot process WILL mount the system disk in read-only so processes running as part of the initrd (initialization RAM disk) can examine the volume to see if it was dismounted cleanly at shutdown or needs a file system check scan run. That initrd logic will alter the access mode to read-write if no issues are found. Forcing the access mode to read-write will cause initrd to leave it read-write, even if lower level failures are found during the file system check. This allows read-write access when the full OS boots. Extreme caution is required any time a volume is forced to read-write mode, however.


Key Lessons Worth Re-Learning

When systems operate well for extended periods of time, it can be very easy to forget old best-practices and even easier to forget key diagnostic techniques required to correct issues. The following lessons are worth highlighting from this particular fire drill.

  1. Manage the OS Installation of OpenSSL Separately Than "User" Installations -- The OS installation of OpenSSL on Linux operating systems is crucial to MANY aspects of system startup and ongoing security. Changes that versions of binaries or shared libraries can trigger complex failures at reboot. If a different version of OpenSSL is needed for "user" purposes, build it into a user directory and alter user-specific environment settings to use that installation for end-user functions.
  2. Always Include a Reboot When Altering the OS OpenSSL Installation -- Anything that breaks the OS installation of OpenSSL can trigger these failures at reboot, rendering a machine potentially unreachable except by console, making it vastly more difficult to fix. If a reboot is performed IMMEDIATELY after updating the OS OpenSSL, the likely cause will be immediately obvious. If a reboot is NOT performed until days / weeks later and the system because unreachable or unusable, the failure will generate MUCH more confusion and take longer to troubleshoot and resolve.
  3. Treat Python Installations the Same as OpenSSL Installations -- Many Linux distributions have adopted Python for use in many of their package administration utilities and desktop related functions. Making changes to the "OS" intance of Python for use with user-level projects is NOT wise. Install "user" builds of Python as an end-user and update user-level $PATH settings to use the user instance instead of the system instance.
  4. Remember Partial Absence of Subdirectories Can be Security Related Rather than Hardware Related -- If a Linux machine boots in read-only mode, missing content tied to specific userids is NOT likely a hardware fault and is likely recoverable. DO NOT give up on the "lost" data and DO NOT confuse the possible recovery by immediately attempting to find replacement copies from other media. Try to cure the logical problem first and exhaust all possibilities.

Tuesday, February 4, 2025

Digital Oscilloscope Screen Captures

Any troubleshooting or design work involving audio equipment or digital logic is often sped up by using an oscilloscope to look at analog wave forms or analyze digital signals and alignment across a circuit. When documenting such troubleshooting and design work, being able to capture a signal trace on an oscilloscope is helpful in communicating a diagnosis or design. Most modern digital scopes provide a USB host port that allows a USB thumb drive to be plugged in and used as the destination in writing screen dumps to paste into other documents.

Since at least 2015, many digital scope makers have expanded beyond this simple approach to capturing screen images by implementing networking and VISA (Virtual Instrument Software Architecture) protocols developed by National Instruments that add significant automation and scripting capabilties to a variety of laboratory gear. APIs implementing these VISA standards have been implemented in a variety of languages including Python, C#, Java and .Net.

(You can see where this is going...)

This capability exists on virtually all scopes. The process of using it will be demonstrated using a Rigol DHO924S scope that operates atop the Android operating system and Python as a scripting language. However, these newer scopes also make screen captures possible without any scripting using functions built within browser interfaces. Both approaches will be shown. The browser appraoch is easier for rare, occassional use but the ability to script a capture allows it to be used within a larger process that might also be scripted.


Connecting to the Scope - USB or IP

Rigol scopes accept connections via a USB Device port tied to a laptop / desktop computer or via IP. Use of the USB interface might be preferable for selected tasks and seems logically preferable given that Rigol scopes do not (yet) have WiFi IP connectivity and a wired Ethernet connection may not always be close to where the scope is being used. However, communicating to a Rigol scope over USB requires installation of an application developed by Rigol called UltraSIgma whose user interface components were last altered around 2016 but visually appear to be coded using mid-1990s frameworks. Given the age of the software, it requires installation by a user with Admin privileges on Windows.


Finding the Scope's VISA Address

VISA libraries use identifiers in a specific format to identify a specific lab device. When accessing a Rigol scope via IP or USB, that identifer will take one of these forms:

TCPIP::192.168.99.29::INSTR
USB0::0x1AB1::0x044C::DHO9S254201528::INSTR

If IP connectivity is used, the VISA address will be visible in the Rigol scope's Utility sub-function as soon as the scope boots and pulls an IP address from the DHCP server. The screen will look like this:

Note that it IS possible to statically assign the IP so the scope obtains the same IP address consistently, avoiding the need to possible change this IP reference in the VISA address in the script. However, in most small networks, DHCP servers in gateway routers will typically re-assign the same IP to the same MAC unless they exhaust their available pool so changing IP addresses isn't often an issue.

If USB connectivity is used, the view in the scope will NOT be updated since a USB connection is not considered a "network" connection. Instead, the USB format VISA address can be identified in two ways. One way is to temporarly connect the scope to an IP network and surf to the scope's IP and gleen the USB address from the top level Rigol Web Control view (see a sample screen dump further below). The other way is to install the UltraSigma software package from Rigol to then allow its parent utility program to be run to display the VISA string.

The view displaying the USB VISA address looks like this in that parent application:

Since the USB designation may change based on which physical USB port on the computer is used and which USB controller is driving that physical port, this information must be discovered from the PC end. There's no way to predict it by looking at information in the scope's displays.

Using Python and pyvisa for Captures

With the scope connected via IP or USB and its VISA address identified, the logic required to make VISA calls to address the scope and trigger a screen capture are very simple in Python. First, two libraries are required which can be installed via these commands.


pip install -U pyvisa
pip install -U pyvisa-py

As of February 4, 2025, these will install version 1.14.1 of pyvisa and version 0.7.2 of pyvisa-py.

With those libraries installed, a simple script like the following will allow an output filename to be specified along with an option to include a datetime stamp like 20250204193059 in the filename.

import pyvisa import argparse import datetime # use argparse to parse arguments for filename and option datestamp parser = argparse.ArgumentParser( prog="capturescope", description='Captures screen dumps via VISA protocol from Rigol oscilloscope', epilog='syntax: capturescreen.py filename --timestamp' ) parser.add_argument('filename',help='name of file without extension to write') parser.add_argument('-t','--timestamp', help='adds yyyymmddhhmmss timestamp to filename',action='store_true') args = parser.parse_args() fullfilename = args.filename if args.timestamp: # need to get the current yyyymmddhhmmss timestamp now = datetime.datetime.now() yyyymmddhhmmss = now.strftime("%Y%m%d%H%M%S") fullfilename = fullfilename + '.' + yyyymmddhhmmss fullfilename = fullfilename + '.png' print("Writing screen capture to: ",fullfilename) # Connect to the oscilloscope rm = pyvisa.ResourceManager() # here is my scope's reference when connected via TCPIP scope = rm.open_resource('TCPIP::192.168.99.29::INSTR') # here is my scope's reference when connected via USB to my laptop # find this via Sigma Ultra app from Rigol or temporarily connect via IP # then surf to http://ipaddress # scope = rm.open_resource('USB0::0x1AB1::0x044C::DHO9S254201528::INSTR') # Set the timeout scope.timeout = 5000 # Get the screenshot screenshot = scope.query_binary_values(':DISP:DATA?', datatype='B') # Save the screenshot as a PNG file with open(fullfilename, 'wb') as f: f.write(bytes(screenshot)) # Close the connection scope.close()

With that script, anything present on the screen can be captured using commands like this:

c:\Docs\gitwork\labutils>python capturescope.py scopeaddress --timestamp
Writing screen capture to:  scopeaddress.20250203215700.png

c:\Docs\gitwork\labutils>python capturescope.py negativeclock --timestamp
Writing screen capture to:  negativeclock.20250203221010.png

c:\Docs\gitwork\labutils>python capturescope.py positiveclock --timestamp
Writing screen capture to:  positiveclock.20250203221056.png

c:\Docs\gitwork\labutils>dir
 Volume in drive C is OS
 Volume Serial Number is 5841-F07E

 Directory of c:\Docs\gitwork\labutils

02/03/2025  10:15 PM    <DIR>          .
02/03/2025  10:15 PM    <DIR>          ..
02/03/2025  09:55 PM             1,327 capturescope.py
02/03/2025  10:10 PM            74,996 negativeclock.20250203221010.png
02/03/2025  10:10 PM            76,419 positiveclock.20250203221056.png
02/03/2025  09:57 PM           103,686 scopeaddress.20250203215700.png
               4 File(s)        256,428 bytes
               2 Dir(s)  500,848,386,048 bytes free

c:\Docs\gitwork\labutils>

Captures via Rigol Web Control

When connected to an IP network, most (all?) Rigol scopes expose a web server on the scope's IP address without SSL encryption or login protection that allow any function that can be performed using the touch screen on the scope to be performed via click in a browser window. For a scope assigned 192.168.99.29 as its IP, surfing to http://192.168.99.29 will display this screen.

Clicking on the Web Control button will pop open a new browser window with a full window matcing the scope's live touchscreen like this. NOTE: It is worth mentioning that this browser view is IDENTICAL in functionality to the touch screen on the scope itself. Any action that can be performed by touching the screen on the scope can be performed by clicking on the same spot on this browser view. HANDY.

It is certainly possible to capture the scope screen using the PC's "screen scraping" utilities (like Window-Shift-S in Windows) to capture the image from this browser view.

It is also possible to use the Print Screen button which displays a different browser page allowing a choice between a static snapshot and a recording.

If the Take Screenshot button is clicked, a screenshot will be captured and rendered in that browser window. At that point, you can right-click on it, choose Save Image As... then write the file wherever desired on the local PC. If the goal is to capture a live change in a signal, the Record Screen button allows control over the start and stop then prompts for the filename and destination to save the *.mp4 video file to the browser PC.