A P P E N D I X A |
Troubleshooting Sun StorageTek QFS |
This appendix describes some tools and procedures that can be used to troubleshoot issues with the Sun StorageTek QFS file system. Specifically, it contains the following topics:
Sun StorageTek QFS file systems write validation data in the following records that are critical to file system operations: directories, indirect blocks, and inodes. If the file system detects corruption while searching a directory, it issues an EDOM error, and the directory is not processed. If an indirect block is not valid, it issues an ENOCSI error, and the file is not processed. TABLE A-1 summarizes these error indicators.
In addition, inodes are validated and cross-checked with directories.
You should monitor the following files for error conditions:
If a discrepancy is noted, you should unmount the file system and check it using the samfsck(1M) command.
Note - The samfsck(1M) command can be issued on a mounted file system, but the results cannot be trusted. Because of this, you are encouraged to run the command on an unmounted file system only. |
|
Use the samfsck(1M) command to perform a file systems check.
Use this command in the following format:
For family-set-name, specify the name of the file system as specified in the mcf file.
You can send output from samfsck(1M) to both your screen and to a file by using it in conjunction with the tee(1) command, as follows.
Nonfatal errors returned by samfsck(1M) are preceded by NOTICE. Nonfatal errors are lost blocks and orphans. The file system is still consistent if NOTICE errors are returned. You can repair these nonfatal errors during a convenient, scheduled maintenance outage.
Fatal errors are preceded by ALERT. These errors include duplicate blocks, invalid directories, and invalid indirect blocks. The file system is not consistent if these errors occur. Notify Sun if the ALERT errors cannot be explained by a hardware malfunction.
If the samfsck(1M) command detects file system corruption and returns ALERT messages, you should determine the reason for the corruption. If hardware is faulty, repair it before repairing the file system.
For more information about the samfsck(1M) and tee(1) commands, see the samfsck(1M) and tee(1) man pages.
|
1. Use the umount(1M) command to unmount the file system.
Run the samfsck(1M) command when the file system is not mounted. For information about unmounting a file system, see Unmounting a File System.
2. Use the samfsck(1M) command to repair a file system. If you are repairing a shared file system, issue the command from the metadata server.
You can issue the samfsck(1M) command in the following format to repair a file system:
For fsname, specify the name of the file system as specified in the mcf file.
The following sections describe what to do when a sammkfs(1M) or mount(1M) command fails or when a mount(1M) command hangs in a shared file system.
The procedures in this section can be performed on client hosts and can also be performed on the server. Commands that can be executed only on the metadata server are preceded with a server# prompt.
If the sammkfs(1M) command returns an error or messages indicating that an unexpected set of devices are to be initialized, you need to perform this procedure. It includes steps for verifying the mcf file and for propagating mcf file changes to the system.
|
1. Use the sam-fsd(1M) command to verify the mcf file.
Examine the output from the sam-fsd(1M) command and determine if there are errors that you need to fix.
2. If the output from the sam-fsd(1M) command indicates that there are errors in the /etc/opt/SUNWsamfs/mcf file, edit the mcf file to resolve these issues.
3. Issue the sam-fsd(1M) command again to verify the mcf file.
Repeat Step 1, Step 2, and Step 3 of this process until the output from the sam-fsd(1M) command indicates that the mcf file is correct.
4. Issue the samd(1M) config command.
This is needed to propagate mcf file changes by informing the sam-fsd daemon of the configuration change.
A mount(1M) command can fail for several reasons. This section describes some actions you can take to remedy a mount problem. If the mount(1M) command hangs, rather than fails, see Recovering From a Hung mount(1M) Command.
Some failed mount(1M) behaviors and their remedies are as follows:
If this procedure does not expose errors, perform To Use the samfsinfo(1M) and samsharefs(1M) Commands, which can help you verify that the file system has been created and that the shared hosts file is correctly initialized.
The following procedure shows you what to verify if the mount(1M) command fails.
1. Ensure that the mount point directory is present.
There are multiple ways to accomplish this. For example, you can issue the ls(1) command in the following format:
For mountpoint, specify the name of the Sun StorageTek QFS shared file system's mount point.
When you examine the ls(1) command's output, make sure that the output shows a directory with access mode 755. In other words, the codes should read drwxr-xr-x. CODE EXAMPLE A-1 shows example output.
# ls -ld /sharefs1 drwxr-xr-x 2 root sys 512 Mar 19 10:46 /sharefs1 |
If the access is not at this level, enter the following chmod(1) command:
For mountpoint, specify the name of the Sun StorageTek QFS shared file system's mount point.
2. Ensure that there is an entry for the file system in the /etc/vfstab file.
CODE EXAMPLE A-2 shows an entry for the shared file system named sharefs1.
# File /etc/vfstab # FS name FS to fsck Mnt pt FS type fsck pass Mt@boot Mt params sharefs1 - /sharefs1 samfs - yes shared,bg |
Ensure that the shared flag is present in the Mount Parameters field of the shared file system's entry in the /etc/vfstab file.
3. Ensure that the mount point directory is not shared out for NFS use.
If the mount point is shared, use the unshare(1M) command to unshare it. For example:
For mountpoint, specify the name of the Sun StorageTek QFS shared file system's mount point.
This procedure shows how to analyze the output from these commands.
1. Enter the samfsinfo(1M) command on the server.
Use this command in the following format:
For filesystem, specify the name of the Sun StorageTek QFS shared file system as specified in the mcf file. CODE EXAMPLE A-3 shows the samfsinfo(1M) command and output.
The output from CODE EXAMPLE A-3 shows a shared keyword in the following line:
Note the list of file system devices, ordinals, and equipment numbers that appear after the following line:
Make sure that these numbers correspond to the devices in the file system's mcf(4) entry.
2. Enter the samsharefs(1M) command on the server.
Use this command in the following format:
For filesystem, specify the name of the Sun StorageTek QFS shared file system as specified in the mcf file. CODE EXAMPLE A-4 shows the samsharefs(1M) command and output.
The following information pertains to the diagnostic output from the samfsinfo(1M) or samsharefs(1M) commands.
If the samfsinfo(1M) and samsharefs(1M) commands do not expose irregularities, perform To Use the samfsconfig(1M) Command.
On clients with nodev device entries in the mcf file for the file system, the entire file system might not be accessible, and the shared hosts file might not be directly accessible. You can use the samfsconfig(1M) command to determine whether the shared file system's data partitions are accessible.
Issue the samfsconfig(1M) command.
Use this command in the following format:
For list-of-devices, specify the list of devices from the file system entry in the mcf file. Use a space to separate multiple devices in the list.
Example 1. CODE EXAMPLE A-5 shows the mcf file for the host tethys, a host that does not have a nodev entry in its mcf file. It then shows the samfsconfig(1M) command issued.
Example 2. CODE EXAMPLE A-6 shows the samfsconfig(1M) command being used on a host that has a nodev entry in its mcf file.
For examples 1 and 2, verify that the output lists all slices from the file system, other than the metadata (mm) devices, as belonging to the file system. This is the case for example 2.
If the mount(1M) command hangs, follow the procedure in this section. You have a hung mount(1M) command if, for example, the mount(1M) command fails with a connection error or with a Server not responding message that does not resolve itself within 30 seconds.
The most typical remedy for a hung mount(1M) command is presented first. If that does not work, perform the subsequent procedures.
The netstat(1M) command verifies that the sam-sharefsd daemon's network connections are correctly configured.
1. Become superuser on the metadata server.
2. Type the samu(1M) command to invoke the samu(1M) operator utility.
3. Press :P to access the Active Services display.
CODE EXAMPLE A-7 shows a P display.
Active Services samu 4.4 09:02:22 Sept 22 2005 Registered services for host `titan': sharedfs.sharefs1 1 service registered. |
Examine the output. In CODE EXAMPLE A-7, look for a line that contains sharedfs.filesystem-name. In this example, the line must contain sharedfs.sharefs1.
If no such line appears, you need to verify that both the sam-fsd and sam-sharefsd daemons have started. Perform the following steps:
a. Enable daemon tracing in the defaults.conf file.
For information about how to enable tracing, see defaults.conf(4) or see Step 2 in To Examine the sam-sharefsd Trace Log.
b. Examine your configuration files, especially /etc/opt/SUNWsamfs/mcf.
c. After you have checked your configuration files and verified that the daemons are active, begin this procedure again.
4. Enter the samsharefs(1M) command to check the hosts file.
CODE EXAMPLE A-11 shows the samsharefs(1M) command and correct output.
In the output on your system, verify the following:
5. Enter the netstat(1M) command on the server.
CODE EXAMPLE A-9 shows the netstat(1M) command entered on server titan.
Verify that the output from the netstat(1M) command on the server contains the following:
This example shows ESTABLISHED entries for tethys and dione. There should be one ESTABLISHED entry for each client that is configured and running, whether or not it is mounted.
6. Enter the netstat(1M) command on the client.
CODE EXAMPLE A-10 shows the netstat(1M) command entered on client dione.
7. Verify that the output contains the following:
If these lines are present, then the network connection is established.
If an ESTABLISHED connection is not reported, perform one or more of the following procedures:
Perform these steps if using the procedure described in To Verify Network Connections did not show an ESTABLISHED connection.
1. Use the samsharefs(1M) command to verify the hosts file on the server.
You can issue the samsharefs(1M) command on alternate server hosts and client hosts that have no nodev devices listed in the host's mcf(4) entry for the file system. For this step, use this command in the following format:
For filesystem, specify the name of the Sun StorageTek QFS shared file system as specified in the mcf file. CODE EXAMPLE A-11 shows the samsharefs(1M) -R command.
If the steps in this procedure fail, you need this output for use in subsequent procedures.
3. Verify that the output matches expectations.
If the command fails, verify that the file system was created. In this case it is likely that one of the following has occurred:
4. Find the row containing the server's name in the first column.
5. From the client, use the ping(1M) command on each entry from the second column of samsharefs(1M) output to verify that the server can be reached.
Use this command in the following format:
For servername, specify the name of the server as shown in the second column of the samsharefs(1M) command's output.
CODE EXAMPLE A-12 shows output from ping(1M).
6. If the ping(1M) command revealed unreachable hosts, examine the hosts.filesystem.local file from the client.
If there is more than one entry in the second column of samsharefs(1M) output, and if some of the entries are not reachable, ensure that only the reachable entries for the entries you want the shared file system to use are present. Also ensure that the necessary entries are present in the /etc/opt/SUNWsamfs/hosts.filesystem.local file entry on that host. Ensure that the unreachable hosts are not entered in these places.
If the sam-sharefsd daemon attempts to connect to unreachable server interfaces, there can be substantial delays in its connecting to the server after installation, rebooting, or file system host reconfiguration. This affects metadata server failover operations substantially.
CODE EXAMPLE A-13 shows the hosts.sharefs1.local file.
dione-client# cat /etc/opt/SUNWsamfs/hosts.sharefs1.local titan titan # no route to 173.26.2.129 tethys tethys # no route to 173.26.2.130 |
7. If the ping(1M) command revealed that there were no reachable server interfaces, enable the correct server interfaces.
Either configure or initialize the server network interfaces for typical operations, or use the samsharefs(1M) command to update the interface names in the hosts file so they match the actual names.
Perform these steps if the procedure in To Verify Network Connections did not show an ESTABLISHED connection.
1. Obtain samsharefs(1M) output.
This can be the output generated in To Verify That the Client Can Reach the Server, or you can generate it again using the initial steps in that procedure.
2. Find the row containing the client's name in the first column.
3. On the client, run the hostname(1M) command and ensure that the output matches the name in the first column of samsharefs(1M) output.
CODE EXAMPLE A-14 shows the hostname(1M) command and its output.
4. If the hostname(1M) command output matched the name in the second column of samsharefs(1M) output, use the ping(1M) command on the server to verify that the client can be reached.
CODE EXAMPLE A-15 shows the ping(1M) command and its output.
It is not necessary that every entry in column two of CODE EXAMPLE A-13 be reachable, but all interfaces that you wish any potential server to accept connections from must be present in the column. The server rejects connections from interfaces that are not declared in the shared hosts file.
5. If the ping(1M) command revealed that there were no reachable client interfaces, enable the correct client interfaces.
Either configure or initialize the client network interfaces for typical operations, or use the samsharefs(1M) command to update the interface names in the hosts file so they match the actual names.
The trace log files keep information generated by the sam-sharefsd(1M) daemons during their operation. The trace log files include information about connections attempted, received, denied, refused, and so on, as well as other operations such as host file changes and metadata server changes.
Tracking problems in log files often involves reconciling the order of operations on different hosts by using the log files. If the hosts' clocks are synchronized, log file interpretation is greatly simplified. One of the installation steps directs you to enable the network time daemon, xntpd(1M). This synchronizes the clocks of the metadata server and all client hosts during Sun StorageTek QFS shared file system operations.
The trace logs are particularly useful when setting up an initial configuration. The client logs show outgoing connection attempts. The corresponding messages in the server log files are some of the most useful tools for diagnosing network and configuration problems with the Sun StorageTek QFS shared file system. The log files contain diagnostic information for resolving most common problems.
The following procedures can resolve most mount(1M) problems:
If none of the preceding procedures resolve the problem, perform the steps in this section. You can perform these steps on both the server and the client hosts.
1. Verify the presence of file /var/opt/SUNWsamfs/trace/sam-sharefsd.
If this file is not present, or if it shows no recent modifications, proceed to the next step.
If the file is present, use tail(1) or another command to examine the last few lines in the file. If it shows suspicious conditions, use one or more of the other procedures in this section to investigate the problem.
2. If Step 1 indicates that file /var/opt/SUNWsamfs/trace/sam-sharefsd does not exist or if the file shows no recent modifications, edit file /etc/opt/SUNWsamfs/defaults.conf and add lines to enable sam-sharefsd tracing.
a. If a defaults.conf file does not already reside in /etc/opt/SUNWsamfs, copy the example defaults.conf file from /opt/SUNWsamfs/examples/defaults.conf to /etc/opt/SUNWsamfs:
b. Use vi(1) or another editor to edit file /etc/opt/SUNWsamfs/defaults.conf and add lines to enable tracing.
CODE EXAMPLE A-16 shows the lines to add to the defaults.conf file.
trace sam-sharefsd = on sam-sharefsd.options = all endtrace |
c. Issue the samd(1M) config command to reconfigure the sam-fsd(1M) daemon and cause it to recognize the new defaults.conf file.
d. Issue the sam-fsd(1M) command to check the configuration files.
CODE EXAMPLE A-17 shows the output from the sam-fsd(1M) command.
e. Examine the log file in /var/opt/SUNWsamfs/trace/sam-sharefsd to check for errors:
3. Examine the last few dozen lines of the trace file for diagnostic information.
CODE EXAMPLE A-18 shows a typical sam-sharefsd client log file. In this example, the server is titan, and the client is dione. This file contains normal log entries generated after a package installation, and it finishes with the daemon operating normally on a mounted file system.
Linux clients and Solaris clients use different procedures to locate system information and diagnose Sun StorageTek QFS issues.
Files that contain system information from the Linux kernel are in the /proc file system. For example the /proc/cpuinfo file contains hardware information. TABLE A-2 describes some files that contain useful troubleshooting information.
Linux kernel log messages go to the /var/log/messages file.
Because the Linux kernel has many variations, troubleshooting problems can be very challenging. A few tools are available that might help in debugging:
Note - These projects are not present by default in Red Hat Linux or SuSE. You must obtain the appropriate RPMs or SRPMs and might have to reconfigure the kernel to use them. |
Note - Trace files are placed in the /var/opt/SUNWsamfs/trace directory on the Linux client, just as they are on the Solaris client. |
The following questions about the Linux client are frequently asked by users who are familiar with Sun StorageTek QFS on the Solaris platform.
Q: The Linux installation script reports that I got a negative score and cannot install the software. Is there any way I can still install the software?
A: You can try the -force-custom and -force-build installation options. However, this may cause a system panic when installing the modules. This is especially a risk if your kernel is built with some of the kernel hacking options enabled, such as spinlock debugging.
Q: Can I use commands such as vmstat, iostat, top, and truss on Linux?
A: The vmstat, top, and iostat commands are found in many Linux installations. If they are not installed, they can be added using the sysstat and procps RPMs. The Linux equivalents of truss are ltrace and strace.
Q: Can Sun StorageTek Traffic Manager be used with the Sun StorageTek QFS Linux client?
A: Yes. First build a custom kernel with multipathing support as described in the Sun StorageTek Traffic Manager documentation. Then install the Linux client software.
Q: Can Extensible Firmware Interface (EFI) labels be used on the Sun StorageTek QFS Linux client?
A: Most Linux kernels are not built with support for EFI labels with GPT (GUID Partition Table) partitions. Therefore, to use EFI labels, you must rebuild the kernel with the CONFIG_EFI_PARTITION option set. For more information about building a custom kernel, see the distribution documentation.
Q: Can I use other Linux volume managers such as logical volume management (LVM), Enterprise Volume Management System (EVMS), or Device Mapper with the Sun StorageTek QFS Linux client software?
A: To use a file system with EVMS, you need to have a File System Interface Module (FSIM) for that file system. No FSIM exists for the Sun StorageTek QFS product. For you to use LVM, the partition type that fdisk shows must be LVM(8e). Partitions that Sun StorageTek QFS uses must be SunOS.
Q: Can I use file systems that are larger than two terabytes?
A: Yes, but some utilities that provide file system information, such as df, might return incorrect information when run on Linux. In addition, there may be problems when sharing the file system with NFS or Samba.
Q: Are there any differences between the mount options supported on the Linux client and those supported on the Solaris client?
A: There are many samfs mount options that are not supported on the Linux client. Two to be aware of are nosuid and forcedirectio. See the Sun StorageTek QFS Linux Client Guide for a complete list of supported mount options on the Linux client.
Copyright © 2007, Sun Microsystems, Inc. All Rights Reserved.