RLEC rsyslog logging documentation¶

Version: Feb 11, 2016

General¶

This documents explains the structure of RLEC log entries that go into rsyslog and how to use these log entries to identify events.

Logging concepts¶

RLEC adds to the log different log entries, from various cluster components, for various events and actions that take place in the cluster. There might be many different log entries for a specific event that can be perceived externally as a single event. For example, in order for the cluster to decide that a cluster node is down there might be various log entries added by different cluster components from various nodes with different descriptions, until the cluster gets to a final decision that the node is actually down. In other cases, similar entries might be added to the log and the cluster will eventually get to a decision that the node is actually not down. In addition, some actions that might seem to the user as an atomic action, like removing a node from the cluster, are actually made up of several different events that take place in a sequence, and might also fail in the process. As a result, Redis Labs enabled that by default all logs entries that are shown in the log page in the management UI will also be written to syslog. Then rsyslog can be configured to monitor the syslog. All alerts are logged to syslog if the alerts are configured to be enabled, in addition to other log entries. The log entries can be categorized into events and alerts . Events just get logged, while alerts have a state attached to them. RLEC log entries include information about the specific event that occurred as detailed below in the Log entry structure section. In addition, rsyslog can be configured to add other information, like the event severity for example. Since rsyslog entries do not include the severity information by default, you can use the following instructions in order to log that information (in Ubuntu):

Add the following line to /etc/rsyslog.conf:

$template TraditionalFormatWithPRI,"%pri-text%: %timegenerated% %HOSTNAME% %syslogtag%%msg:::drop-last-lf%\n"
modify $ActionFileDefaultTemplate to use your new template $ActionFileDefaultTemplate TraditionalFormatWithPRI

Make sure to save the changes and restart rsyslog in order for the changes to take effect. you can see the alerts & events under /var/log in messages log file.

Command components¶

%pritext% - adds the severity
%timegenerated% - adds the timestamp
%HOSTNAME% - adds the machine name
%syslogtag% - the RLEC message as detailed below in the Log entry structure section below.
%msg:::droplastlf%n removes duplicated log entries

Log entry structure¶

The log entries have the following basic structure:

event_log[<process id>]:{<list of key value pairs in any order>}

event_log - plain static text that will always show at the beginning of the entry.
process id - the id of the process the logging in running under.
list of key value pairs in any order a list of key value pairs describing the specific event. The keyvalues pairs can appear in any order. Some keyvalue pairs will always appear, and some appear depending on the specific event.
- Keyvalue pairs that will always appear:
  
  “type” - A unique codename identifying the event logged. For the list of codenames see List of Logged Events and Alerts.
  
  “object” - has the format of “<object type>[:<id>]”. Defines the object type, and id if relevant, of the object this event is related to. For example cluster, node with id, bdb with id, etc’.
  
  “time” - unix time, can be ignored in this context.
- Keyvalue pairs that might appear depending on the specific entry:
  
  “state” - boolean with value true or false. This is relevant only for entries from category alert. True means that the alert is on. False means that the alert is off.
  
  “global_threshold” - a value of a threshold for alerts related to the “cluster” or “node” objects.
  
  “threshold” - a value of a threshold for alerts related to the “bdb” object.

Log entry samples¶

Below are examples for log entries that include the rsyslog configuration mentioned above that add the severity, timestamp and machine name.

Ephemeral storage passed threshold¶

“Alert on” log entry sample¶

Example:

daemon.warning: Jun 14 14:49:20 node1 event_log[3464]:{"storage_util": 90.061643120001, "global_threshold": "70", "object": "node:1", "state": true, "time": 1434282560, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been raised as result of storage utilization reaching the value of ~90%.

Log entry components:
- daemon.warning: severity of entry is warning
- Jun 14 14:49:20: the timestamp of the event
- node1: machine name
- event_log: - static text that always appears
- [3464]: process id
- “storage_util”: 90.061643120001 current ephemeral storage utilization, in this case ~90%
- “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “state”: true current state of the alert, in this case it is on
- “time”: 1434282560 can be ignored
- “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.

“Alert off” log entry sample¶

Example:

daemon.info: Jun 14 14:51:35 node1 event_log[3464]: {"storage_util":60.051723520008, "global_threshold": "70", "object": "node:1", "state":false, "time": 1434283480, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been turned off as result of storage utilization reaching the value of ~60%.

Log entry components:
- “daemon.info”: severity of entry is info
- “Jun 14 14:51:35”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[3464]”: process id
- “storage_util”: 60.051723520008 current ephemeral storage utilization, in this case ~60%
- “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “state”: false current state of the alert, in this case it is on
- “time”: 1434283480 can be ignored
- “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.

Odd number of nodes with a minimum of three nodes alert¶

“Alert on” log entry sample¶

Example:

daemon.warning: Jun 14 15:25:00 node1 event_log[8310]: {"object":"cluster", "state": true, "time": 1434284700, "node_count": 1, "type":"even_node_count"}

The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned on as result of the cluster having only one node.

Log entry components:
- “daemon.warning”: severity of entry is warning
- “Jun 14 15:25:00”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[8310]”: process id
- “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
- “state”: true current state of the alert, in this case it is on
- “time”: 1434284700 can be ignored
- “node_count”: 1 the number of nodes in the cluster, in this case 1
- “type”: “even_node_count” is the code name identifier of this specific event, see see List of Logged Events and Alerts.

“Alert off” log entry sample¶

Example:

daemon.warning: Jun 14 15:30:40 node1 event_log[8310]: {"object":"cluster", "state": false, "time": 1434285200, "node_count": 3, "type":"even_node_count"}

The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned off as result of the cluster having 3 nodes.

Log entry components:
- “daemon.info”: severity of entry is warning
- “Jun 14 15:30:40”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[8310]”: process id
- “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
- “state”: false current state of the alert, in this case it is off
- “time”: 1434285200 can be ignored
- “node_count”: 3 - the number of nodes in the cluster, in this case 3
- “type”: “even_node_count” is the code name identifier of this specific event, List of Logged Events and Alerts.

Node has insufficient disk space for AOF rewrite¶

“Alert on” log entry sample¶

Example:

daemon.err: Jun 15 13:51:23 node1 event_log[34252]: {"used": 23457188, "missing": 604602126, "object": "node:1", "free": 9867264, "needed":637926578, "state": true, "time": 1434365483, "disk": 705667072, "type":"insufficient_disk_aofrw"}

The log entry above is an example of when the alert for “Node has insufficient disk space for AOF rewrite” has been turned on as result of not having enough persistent storage disk space for AOF rewrite purposes. It is missing 604602126 bytes.

Log entry components:
- “daemon.err”: severity of entry is err
- “Jun 15 13:51:23”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[34252]”: process id
- “used”: 23457188 the amount of disk space in bytes currently used for AOF files
- “missing”: 604602126 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “free”: 9867264 the amount of disk space in bytes that is currently free
- “needed”: 637926578 the amount of total disk space in bytes that is needed for AOF rewrite purposes
- “state”: true current state of the alert, in this case it is on
- “time”: 1434365483 can be ignored
- “disk”: 705667072 the total size in bytes of the persistent storage
- “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.

“Alert off” log entry sample¶

Example:

daemon.info: Jun 15 13:51:11 node1 event_log[34252]: {"used": 0, "missing":-21614592, "object": "node:1", "free": 21614592, "needed": 0, "state":false, "time": 1434365471, "disk": 705667072, "type":"insufficient_disk_aofrw"}

Log entry components:
- “daemon.info”: severity of entry is info
- “Jun 15 13:51:11”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[34252]”: process id
- “used”: 0 the amount of disk space in bytes currently used for AOF files
- “missing”: -21614592 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes, in this case it is not missing because the number is negative
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “free”: 21614592 the amount of disk space in bytes that is currently free
- “needed”: 0 the amount of total disk space in bytes that is needed for AOF rewrite purposes, in this case no space is needed
- “state”: false current state of the alert, in this case it is off
- “time”: 1434365471 can be ignored
- “disk”: 705667072 the total size in bytes of the persistent storage
- “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.

List of Logged Events and Alerts¶

Events and Alerts that may be written to syslog

UI Events¶

Logged events and alerts that appear in the UI.

Event code-name	Event as shown in the ui	Object type	Category	Severity	Notes
aof_slow_disk_io	Redis performance is degraded as result of disk I/O limits	node	alert	True: error, False: info
backup_delayed	Periodic backup has been delayed for longer than <threshold> minutes	bdb	alert	True: warning, False: info	Has threshold parameter in the data: section of the log entry.
cpu_utilization	CPU utilization has reached <threshold>%	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
ephemeral_storage	Ephemeral storage has reached <threshold>% of its capacity	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
even_node_count	True high availability requires an odd number of nodes with a minimum of three nodes	cluster	alert	True: warning, False: info
failed	Node failed	node	alert	critical
free_flash	Flash storage has reached <threshold>% of its capacity	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
high_latency	Latency is higher than <threshold> msec	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
high_syncer_lag	Replica of - sync lag is higher than <threshold> seconds	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
high_throughput	Throughput is higher than <threshold> RPS (requests per second)	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
inconsistent_redis_sw	Not all databases are running the same open source version	cluster	alert	True: warning, False: info
inconsistent_rl_sw	Not all nodes in the cluster are running the same Redis Labs Enterprise Cluster version	cluster	alert	True: warning, False: info
insufficient_disk_aofrw	Node has insufficient disk space for AOF rewrite	node	alert	True: error, False: info
internal_bdb	Issues with internal cluster databases	cluster	alert	True: warning, False: info
low_throughput	Throughput is lower than <threshold> RPS (requests per second)	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
memory	Node memory has reached <threshold>% of its capacity	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
multiple_nodes_down	Multiple cluster nodes are down - this might cause data loss	cluster	alert	True: warning, False: info
net_throughput	Network throughput has reached <threshold>MB/s	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
node_joined	Node joined	cluster	event	info
node_remove_abort_completed	Node removed	cluster	event	info	The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_abort_failed	Node removed	cluster	event	error	The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_completed	Node removed	cluster	event	info	The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
node_remove_failed	Node removed	cluster	event	error	The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail.
persistent_storage	Persistent storage has reached <threshold>% of its capacity	node	alert	True: warning, False: info	Has global_threshold parameter in the key/value section of the log entry.
ram_dataset_overhead	RAM Dataset overhead in a shard has reached <threshold>% of its RAM limit	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
ram_overcommit	Cluster capacity is less than total memory allocated to its databases	cluster	alert	True: error, False: info
ram_values	Percent of values in a shard’s RAM is lower than <threshold>% of its key count	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
shard_num_ram_values	Number of values in a shard’s RAM is lower than <threshold> values	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
size	Dataset size has reached <threshold>% of the memory limit	bdb	alert	True: warning, False: info	Has threshold parameter in the key/value section of the log entry.
syncer_connection_error	Replica of - database unable to sync with source	bdb	alert	error
syncer_general_error	Replica of - database unable to sync with source	bdb	alert	error
too_few_nodes_for_replication	Database replication requires at least two nodes in cluster	cluster	alert	True: warning, False: info

Non UI Events¶

Logged events and alerts that do not appear in the UI

Event code-name	Object type	Category	Severity	Notes
authentication_err	bdb	event	error	Replica of - error authenticating with the source database
backup_failed	bdb	event	error
backup_started	bdb	event	info
backup_succeeded	bdb	event	info
bdb_created	bdb	event	info
bdb_deleted	bdb	event	info
bdb_updated	bdb	event	info	indicates that a bdb configuration has been updated
checks_error	node	event	error	Indicates that one or more node checks have failed
cluster_updated	cluster	event	info	indicates that cluster settings have been updated
compression_unsup_err	bdb	event	error	Replica of - Compression not supported by sync destination
crossslot_err	bdb	event	error	Replica of - sharded destination does not support operation executed on source
export_failed	bdb	event	error
export_started	bdb	event	info
export_succeeded	bdb	event	info
import_failed	bdb	event	error
import_started	bdb	event	info
import_succeeded	bdb	event	info
license_added	cluster	event	info
license_deleted	cluster	event	info
license_updated	cluster	event	info
node_abort_remove_request	node	event	info
node_remove_request	node	event	info
oom_err	bdb	event	error	Replica of - Replication source/target out of memory
user_created	user	event	info
user_deleted	user	event	info
user_updated	user	event	info	indicates that a user configuration has been updated

Table Of Contents

Previous topic

Next topic

This Page

RLEC rsyslog logging documentation¶

General¶

Logging concepts¶

Command components¶

Log entry structure¶

Log entry samples¶

Ephemeral storage passed threshold¶

“Alert on” log entry sample¶

“Alert off” log entry sample¶

Odd number of nodes with a minimum of three nodes alert¶

“Alert on” log entry sample¶

“Alert off” log entry sample¶

Node has insufficient disk space for AOF rewrite¶

“Alert on” log entry sample¶

“Alert off” log entry sample¶

List of Logged Events and Alerts¶

UI Events¶

Non UI Events¶