Disk Usage and Drain of Events Health Monitor Alerts

The Disk Usage health module compares disk usage on a managed device’s hard drive and malware storage pack to the limits configured for the module and alerts when usage exceeds the percentages configured for the module. This module also alerts when the system excessively deletes files in monitored disk usage categories, or when disk usage excluding those categories reaches excessive levels, based on module thresholds.

This topic describes the symptoms and troubleshooting guidelines for two health alerts generated by the Disk Usage health module:

  • Frequent Drain of Events

  • Drain of Unprocessed Events

The disk manager process manages the disk usage of a device. Each type of file monitored by the disk manager is assigned a silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low Water Mark (LWM) for each silo.

To display detailed disk usage information for each part of the system, including silos, LWMs, and HWMs, use the show disk-manager command.

Examples

Following is an example of the disk manager information.


> show disk-manager
Silo                                    Used        Minimum     Maximum
Temporary Files                         0 KB        499.197 MB  1.950 GB
Action Queue Results                    0 KB        499.197 MB  1.950 GB
User Identity Events                    0 KB        499.197 MB  1.950 GB
UI Caches                               4 KB        1.462 GB    2.925 GB
Backups                                 0 KB        3.900 GB    9.750 GB
Updates                                 0 KB        5.850 GB    14.625 GB
Other Detection Engine                  0 KB        2.925 GB    5.850 GB
Performance Statistics                  33 KB       998.395 MB  11.700 GB
Other Events                            0 KB        1.950 GB    3.900 GB
IP Reputation & URL Filtering           0 KB        2.437 GB    4.875 GB
Archives & Cores & File Logs            0 KB        3.900 GB    19.500 GB
Unified Low Priority Events             1.329 MB    4.875 GB    24.375 GB
RNA Events                              0 KB        3.900 GB    15.600 GB
File Capture                            0 KB        9.750 GB    19.500 GB
Unified High Priority Events            0 KB        14.625 GB   34.125 GB
IPS Events                              0 KB        11.700 GB   29.250 GB

Health Alert Format

When the Health Monitor process on the CDO runs (once every 5 minutes or when a manual run is triggered) the Disk Usage module looks into the diskmanager.log file and, if the correct conditions are met, the respective health alert is triggered.

The structures of these health alerts are as follows:

  • Frequent drain of <SILO NAME>

  • Drain of unprocessed events from <SILO NAME>

For example,

  • Frequent drain of Low Priority Events

  • Drain of unprocessed events from Low Priority Events

It’s possible for any silo to generate a Frequent drain of <SILO NAME> health alert. However, the most commonly seen are the alerts related to events. Among the event silos, the Low Priority Events are often seen because these type of events are generated by the device more frequently.

A Frequent drain of <SILO NAME> event has a Warning severity level when seen in relation to an event-related silo, because events will be queued to be sent to the CDO. For a non-event related silo, such as the Backups silo, the alert has a Critical severity level because this information is lost.

Important

Only event silos generate a Drain of unprocessed events from <SILO NAME> health alert. This alert always has Critical severity level.

Additional symptoms besides the alerts can include:

  • Slowness on the CDO user interface

  • Loss of events

Common Troubleshoot Scenarios

A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event type.

In the case of a Drain of unprocessed events of <SILO NAME> health alert, this can also be caused by a bottleneck in the event processing path.

There are three potential bottlenecks with respect to these Disk Usage alerts:

  • Excessive logging ― The EventHandler process on FTD is oversubscribed (it reads slower than what Snort writes).

  • Sftunnel bottleneck ― The Eventing interface is unstable or oversubscribed.

  • SFDataCorrelator bottleneck ― The data transmission channel between the CDO and the managed device is oversubscribed.

Excessive Logging

One of the most common causes for the health alerts of this type is excessive input. The difference between the Low Water Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager command shows how much space there is available to take on that silo to go from LWM (freshly drained) to the HWM value. If there are frequent drain of events (with or without unprocessed events) the first thing to review is the logging configuration.

  • Check for double logging ― Double logging scenarios can be identified if you look at the correlator perfstats on the CDO:

    admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now

  • Check logging settings for the ACP ― Review the logging settings of the Access Control Policy (ACP). If logging both "Beginning" and "End" of connection, log only the end as it will include everything included when the beginning is logged as well as reduce the amount of events.

    Ensure that you follow the best practices described in Best Practices for Connection Logging.

Communications Bottleneck ― Sftunnel

Sftunnel is responsible for encrypted communications between the CDO and the managed device. Events are sent over the tunnel to the CDO. Connectivity issues and/or instability in the communication channel (sftunnel) between the managed device and the CDO can be due to:

  • Sftunnel is down or is unstable (flaps).

    Ensure that the CDO and the managed device have reachability between their management interfaces on TCP port 8305.

    The sftunnel process should be stable and should not restart unexpectedly. Verify this by checking the /var/log/message file and search for messages that contain the sftunneld string.

  • Sftunnel is oversubscribed.

    Review trend data from the Heath Monitor and look for signs of oversubscription of the CDO's management interface, which can be a spike in management traffic or a constant oversubscription.

    Use as a secondary management interface for Firepower-eventing. To use this interface, you must configure its IP address and other parameters at the FTD CLI using the configure network management-interface command.

Communications Bottleneck ― SFDataCorrelator

The SFDataCorrelator manages data transmission between the CDO and the managed device; on the CDO, it analyzes binary files created by the system to generate events, connection data, and network maps. The first step is to review the diskmanager.log file for important information to be gathered, such as:

  • The frequency of the drain.

  • The number of files with Unprocessed Events drained.

  • The occurrence of the drain with Unprocessed Events.

Each time the disk manager process runs it generates an entry for each of the different silos on its own log file, which is located under [/ngfw]/var/log/diskmanager.log. Information gathered from the diskmanager.log (in CSV format) can be used to help narrow the search for a cause.

Additional troubleshooting steps:

  • The command stats_unified.pl can help you to determine if the managed device does have some data which needs to be sent to CDO. This condition can happen when the managed device and the CDO experience a connectivity issue. The managed device stores the log data onto a hard drive.

    admin@FMC:~$ sudo stats_unified.pl

  • The manage_proc.pl command can reconfigure the correlator on the CDO side.

    root@FMC:~# manage_procs.pl

Before You Contact Cisco Technical Assistance Center (TAC)

It is highly recommended to collect these items before you contact Cisco TAC:

  • Screenshots of the health alerts seen.

  • Troubleshoot file generated from the CDO.

  • Troubleshoot file generated from the affected managed device.

    Date and Time when the problem was first seen.

  • Information about any recent changes done to the policies (if applicable).

    The output of the stats_unified.pl command as described in the Communications Bottleneck ― SFDataCorrelator.