RLEC rsyslog logging documentation

Version: Feb 11, 2016

General

This documents explains the structure of RLEC log entries that go into rsyslog and how to use these log entries to identify events.

Logging concepts

RLEC adds to the log different log entries, from various cluster components, for various events and actions that take place in the cluster. There might be many different log entries for a specific event that can be perceived externally as a single event. For example, in order for the cluster to decide that a cluster node is down there might be various log entries added by different cluster components from various nodes with different descriptions, until the cluster gets to a final decision that the node is actually down. In other cases, similar entries might be added to the log and the cluster will eventually get to a decision that the node is actually not down. In addition, some actions that might seem to the user as an atomic action, like removing a node from the cluster, are actually made up of several different events that take place in a sequence, and might also fail in the process. As a result, Redis Labs enabled that by default all logs entries that are shown in the log page in the management UI will also be written to syslog. Then rsyslog can be configured to monitor the syslog. All alerts are logged to syslog if the alerts are configured to be enabled, in addition to other log entries. The log entries can be categorized into events and alerts . Events just get logged, while alerts have a state attached to them. RLEC log entries include information about the specific event that occurred as detailed below in the Log entry structure section. In addition, rsyslog can be configured to add other information, like the event severity for example. Since rsyslog entries do not include the severity information by default, you can use the following instructions in order to log that information (in Ubuntu):

  • Add the following line to /etc/rsyslog.conf:
    $template TraditionalFormatWithPRI,"%pri-text%: %timegenerated% %HOSTNAME% %syslogtag%%msg:::drop-last-lf%\n"
  • modify $ActionFileDefaultTemplate to use your new template $ActionFileDefaultTemplate TraditionalFormatWithPRI

Make sure to save the changes and restart rsyslog in order for the changes to take effect. you can see the alerts & events under /var/log in messages log file.

Command components

  • %pritext% - adds the severity
  • %timegenerated% - adds the timestamp
  • %HOSTNAME% - adds the machine name
  • %syslogtag% - the RLEC message as detailed below in the Log entry structure section below.
  • %msg:::droplastlf%n removes duplicated log entries

Log entry structure

The log entries have the following basic structure:

event_log[<process id>]:{<list of key value pairs in any order>}
  • event_log - plain static text that will always show at the beginning of the entry.
  • process id - the id of the process the logging in running under.
  • list of key value pairs in any order a list of key value pairs describing the specific event. The keyvalues pairs can appear in any order. Some keyvalue pairs will always appear, and some appear depending on the specific event.
    • Keyvalue pairs that will always appear:
      • “type” - A unique codename identifying the event logged. For the list of codenames see List of Logged Events and Alerts.
      • “object” - has the format of “<object type>[:<id>]”. Defines the object type, and id if relevant, of the object this event is related to. For example cluster, node with id, bdb with id, etc’.
      • “time” - unix time, can be ignored in this context.
    • Keyvalue pairs that might appear depending on the specific entry:
      • “state” - boolean with value true or false. This is relevant only for entries from category alert. True means that the alert is on. False means that the alert is off.
      • “global_threshold” - a value of a threshold for alerts related to the “cluster” or “node” objects.
      • “threshold” - a value of a threshold for alerts related to the “bdb” object.

Log entry samples

Below are examples for log entries that include the rsyslog configuration mentioned above that add the severity, timestamp and machine name.

Ephemeral storage passed threshold

“Alert on” log entry sample

Example:

daemon.warning: Jun 14 14:49:20 node1 event_log[3464]:{"storage_util": 90.061643120001, "global_threshold": "70", "object": "node:1", "state": true, "time": 1434282560, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been raised as result of storage utilization reaching the value of ~90%.

  • Log entry components:
    • daemon.warning: severity of entry is warning
    • Jun 14 14:49:20: the timestamp of the event
    • node1: machine name
    • event_log: - static text that always appears
    • [3464]: process id
    • “storage_util”: 90.061643120001 current ephemeral storage utilization, in this case ~90%
    • “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
    • “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
    • “state”: true current state of the alert, in this case it is on
    • “time”: 1434282560 can be ignored
    • “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.

“Alert off” log entry sample

Example:

daemon.info: Jun 14 14:51:35 node1 event_log[3464]: {"storage_util":60.051723520008, "global_threshold": "70", "object": "node:1", "state":false, "time": 1434283480, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been turned off as result of storage utilization reaching the value of ~60%.

  • Log entry components:
    • “daemon.info”: severity of entry is info
    • “Jun 14 14:51:35”: the timestamp of the event
    • “node1”: machine name
    • “event_log”: static text that always appears
    • “[3464]”: process id
    • “storage_util”: 60.051723520008 current ephemeral storage utilization, in this case ~60%
    • “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
    • “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
    • “state”: false current state of the alert, in this case it is on
    • “time”: 1434283480 can be ignored
    • “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.

Odd number of nodes with a minimum of three nodes alert

“Alert on” log entry sample

Example:

daemon.warning: Jun 14 15:25:00 node1 event_log[8310]: {"object":"cluster", "state": true, "time": 1434284700, "node_count": 1, "type":"even_node_count"}

The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned on as result of the cluster having only one node.

  • Log entry components:
    • “daemon.warning”: severity of entry is warning
    • “Jun 14 15:25:00”: the timestamp of the event
    • “node1”: machine name
    • “event_log”: static text that always appears
    • “[8310]”: process id
    • “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
    • “state”: true current state of the alert, in this case it is on
    • “time”: 1434284700 can be ignored
    • “node_count”: 1 the number of nodes in the cluster, in this case 1
    • “type”: “even_node_count” is the code name identifier of this specific event, see see List of Logged Events and Alerts.

“Alert off” log entry sample

Example:

daemon.warning: Jun 14 15:30:40 node1 event_log[8310]: {"object":"cluster", "state": false, "time": 1434285200, "node_count": 3, "type":"even_node_count"}

The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned off as result of the cluster having 3 nodes.

  • Log entry components:
    • “daemon.info”: severity of entry is warning
    • “Jun 14 15:30:40”: the timestamp of the event
    • “node1”: machine name
    • “event_log”: static text that always appears
    • “[8310]”: process id
    • “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
    • “state”: false current state of the alert, in this case it is off
    • “time”: 1434285200 can be ignored
    • “node_count”: 3 - the number of nodes in the cluster, in this case 3
    • “type”: “even_node_count” is the code name identifier of this specific event, List of Logged Events and Alerts.

Node has insufficient disk space for AOF rewrite

“Alert on” log entry sample

Example:

daemon.err: Jun 15 13:51:23 node1 event_log[34252]: {"used": 23457188, "missing": 604602126, "object": "node:1", "free": 9867264, "needed":637926578, "state": true, "time": 1434365483, "disk": 705667072, "type":"insufficient_disk_aofrw"}

The log entry above is an example of when the alert for “Node has insufficient disk space for AOF rewrite” has been turned on as result of not having enough persistent storage disk space for AOF rewrite purposes. It is missing 604602126 bytes.

  • Log entry components:
    • “daemon.err”: severity of entry is err
    • “Jun 15 13:51:23”: the timestamp of the event
    • “node1”: machine name
    • “event_log”: static text that always appears
    • “[34252]”: process id
    • “used”: 23457188 the amount of disk space in bytes currently used for AOF files
    • “missing”: 604602126 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes
    • “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
    • “free”: 9867264 the amount of disk space in bytes that is currently free
    • “needed”: 637926578 the amount of total disk space in bytes that is needed for AOF rewrite purposes
    • “state”: true current state of the alert, in this case it is on
    • “time”: 1434365483 can be ignored
    • “disk”: 705667072 the total size in bytes of the persistent storage
    • “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.

“Alert off” log entry sample

Example:

daemon.info: Jun 15 13:51:11 node1 event_log[34252]: {"used": 0, "missing":-21614592, "object": "node:1", "free": 21614592, "needed": 0, "state":false, "time": 1434365471, "disk": 705667072, "type":"insufficient_disk_aofrw"}
  • Log entry components:
    • “daemon.info”: severity of entry is info
    • “Jun 15 13:51:11”: the timestamp of the event
    • “node1”: machine name
    • “event_log”: static text that always appears
    • “[34252]”: process id
    • “used”: 0 the amount of disk space in bytes currently used for AOF files
    • “missing”: -21614592 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes, in this case it is not missing because the number is negative
    • “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
    • “free”: 21614592 the amount of disk space in bytes that is currently free
    • “needed”: 0 the amount of total disk space in bytes that is needed for AOF rewrite purposes, in this case no space is needed
    • “state”: false current state of the alert, in this case it is off
    • “time”: 1434365471 can be ignored
    • “disk”: 705667072 the total size in bytes of the persistent storage
    • “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.

List of Logged Events and Alerts

Events and Alerts that may be written to syslog

UI Events

Logged events and alerts that appear in the UI.

Event code-name Event as shown in the ui Object type Category Severity Notes
aof_slow_disk_io Redis performance is degraded as result of disk I/O limits node alert True: error, False: info  
backup_delayed Periodic backup has been delayed for longer than <threshold> minutes bdb alert True: warning, False: info Has threshold parameter in the data: section of the log entry.
cpu_utilization CPU utilization has reached <threshold>% node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
ephemeral_storage Ephemeral storage has reached <threshold>% of its capacity node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
even_node_count True high availability requires an odd number of nodes with a minimum of three nodes cluster alert True: warning, False: info  
failed Node failed node alert critical  
free_flash Flash storage has reached <threshold>% of its capacity node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
high_latency Latency is higher than <threshold> msec bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
high_syncer_lag Replica of - sync lag is higher than <threshold> seconds bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
high_throughput Throughput is higher than <threshold> RPS (requests per second) bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
inconsistent_redis_sw Not all databases are running the same open source version cluster alert True: warning, False: info  
inconsistent_rl_sw Not all nodes in the cluster are running the same Redis Labs Enterprise Cluster version cluster alert True: warning, False: info  
insufficient_disk_aofrw Node has insufficient disk space for AOF rewrite node alert True: error, False: info  
internal_bdb Issues with internal cluster databases cluster alert True: warning, False: info  
low_throughput Throughput is lower than <threshold> RPS (requests per second) bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
memory Node memory has reached <threshold>% of its capacity node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
multiple_nodes_down Multiple cluster nodes are down - this might cause data loss cluster alert True: warning, False: info  
net_throughput Network throughput has reached <threshold>MB/s node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
node_joined Node joined cluster event info  
node_remove_abort_completed Node removed cluster event info The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_abort_failed Node removed cluster event error The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_completed Node removed cluster event info The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_failed Node removed cluster event error The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
persistent_storage Persistent storage has reached <threshold>% of its capacity node alert True: warning, False: info Has global_threshold parameter in the key/value section of the log entry.
ram_dataset_overhead RAM Dataset overhead in a shard has reached <threshold>% of its RAM limit bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
ram_overcommit Cluster capacity is less than total memory allocated to its databases cluster alert True: error, False: info  
ram_values Percent of values in a shard’s RAM is lower than <threshold>% of its key count bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
shard_num_ram_values Number of values in a shard’s RAM is lower than <threshold> values bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
size Dataset size has reached <threshold>% of the memory limit bdb alert True: warning, False: info Has threshold parameter in the key/value section of the log entry.
syncer_connection_error Replica of - database unable to sync with source bdb alert error  
syncer_general_error Replica of - database unable to sync with source bdb alert error  
too_few_nodes_for_replication Database replication requires at least two nodes in cluster cluster alert True: warning, False: info  

Non UI Events

Logged events and alerts that do not appear in the UI

Event code-name Object type Category Severity Notes
authentication_err bdb event error Replica of - error authenticating with the source database
backup_failed bdb event error  
backup_started bdb event info  
backup_succeeded bdb event info  
bdb_created bdb event info  
bdb_deleted bdb event info  
bdb_updated bdb event info indicates that a bdb configuration has been updated
checks_error node event error Indicates that one or more node checks have failed
cluster_updated cluster event info indicates that cluster settings have been updated
compression_unsup_err bdb event error Replica of - Compression not supported by sync destination
crossslot_err bdb event error Replica of - sharded destination does not support operation executed on source
export_failed bdb event error  
export_started bdb event info  
export_succeeded bdb event info  
import_failed bdb event error  
import_started bdb event info  
import_succeeded bdb event info  
license_added cluster event info  
license_deleted cluster event info  
license_updated cluster event info  
node_abort_remove_request node event info  
node_remove_request node event info  
oom_err bdb event error Replica of - Replication source/target out of memory
user_created user event info  
user_deleted user event info  
user_updated user event info indicates that a user configuration has been updated