RLEC rsyslog logging documentation¶
Version: Feb 11, 2016
General¶
This documents explains the structure of RLEC log entries that go into rsyslog and how to use these log entries to identify events.
Logging concepts¶
RLEC adds to the log different log entries, from various cluster components, for various events and actions that take place in the cluster. There might be many different log entries for a specific event that can be perceived externally as a single event. For example, in order for the cluster to decide that a cluster node is down there might be various log entries added by different cluster components from various nodes with different descriptions, until the cluster gets to a final decision that the node is actually down. In other cases, similar entries might be added to the log and the cluster will eventually get to a decision that the node is actually not down. In addition, some actions that might seem to the user as an atomic action, like removing a node from the cluster, are actually made up of several different events that take place in a sequence, and might also fail in the process. As a result, Redis Labs enabled that by default all logs entries that are shown in the log page in the management UI will also be written to syslog. Then rsyslog can be configured to monitor the syslog. All alerts are logged to syslog if the alerts are configured to be enabled, in addition to other log entries. The log entries can be categorized into events and alerts . Events just get logged, while alerts have a state attached to them. RLEC log entries include information about the specific event that occurred as detailed below in the Log entry structure section. In addition, rsyslog can be configured to add other information, like the event severity for example. Since rsyslog entries do not include the severity information by default, you can use the following instructions in order to log that information (in Ubuntu):
- Add the following line to /etc/rsyslog.conf:
$template TraditionalFormatWithPRI,"%pri-text%: %timegenerated% %HOSTNAME% %syslogtag%%msg:::drop-last-lf%\n"
- modify
$ActionFileDefaultTemplate
to use your new template$ActionFileDefaultTemplate TraditionalFormatWithPRI
Make sure to save the changes and restart rsyslog in order for the changes to take effect. you can see the alerts & events under /var/log in messages log file.
Command components¶
- %pritext% - adds the severity
- %timegenerated% - adds the timestamp
- %HOSTNAME% - adds the machine name
- %syslogtag% - the RLEC message as detailed below in the Log entry structure section below.
- %msg:::droplastlf%n removes duplicated log entries
Log entry structure¶
The log entries have the following basic structure:
event_log[<process id>]:{<list of key value pairs in any order>}
- event_log - plain static text that will always show at the beginning of the entry.
- process id - the id of the process the logging in running under.
- list of key value pairs in any order a list of key value pairs describing the specific event. The keyvalues pairs can appear in any order. Some keyvalue pairs will always appear, and some appear depending on the specific event.
- Keyvalue pairs that will always appear:
- “type” - A unique codename identifying the event logged. For the list of codenames see List of Logged Events and Alerts.
- “object” - has the format of “<object type>[:<id>]”. Defines the object type, and id if relevant, of the object this event is related to. For example cluster, node with id, bdb with id, etc’.
- “time” - unix time, can be ignored in this context.
- Keyvalue pairs that might appear depending on the specific entry:
- “state” - boolean with value true or false. This is relevant only for entries from category alert. True means that the alert is on. False means that the alert is off.
- “global_threshold” - a value of a threshold for alerts related to the “cluster” or “node” objects.
- “threshold” - a value of a threshold for alerts related to the “bdb” object.
Log entry samples¶
Below are examples for log entries that include the rsyslog configuration mentioned above that add the severity, timestamp and machine name.
Ephemeral storage passed threshold¶
“Alert on” log entry sample¶
Example:
daemon.warning: Jun 14 14:49:20 node1 event_log[3464]:{"storage_util": 90.061643120001, "global_threshold": "70", "object": "node:1", "state": true, "time": 1434282560, "type": "ephemeral_storage"}
The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been raised as result of storage utilization reaching the value of ~90%.
- Log entry components:
- daemon.warning: severity of entry is warning
- Jun 14 14:49:20: the timestamp of the event
- node1: machine name
- event_log: - static text that always appears
- [3464]: process id
- “storage_util”: 90.061643120001 current ephemeral storage utilization, in this case ~90%
- “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “state”: true current state of the alert, in this case it is on
- “time”: 1434282560 can be ignored
- “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.
“Alert off” log entry sample¶
Example:
daemon.info: Jun 14 14:51:35 node1 event_log[3464]: {"storage_util":60.051723520008, "global_threshold": "70", "object": "node:1", "state":false, "time": 1434283480, "type": "ephemeral_storage"}
The log entry above is an example of when the alert for node with id 1 “Ephemeral storage has reached 70% of its capacity” has been turned off as result of storage utilization reaching the value of ~60%.
- Log entry components:
- “daemon.info”: severity of entry is info
- “Jun 14 14:51:35”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[3464]”: process id
- “storage_util”: 60.051723520008 current ephemeral storage utilization, in this case ~60%
- “global_threshold”: “70” the user configured threshold above which the alert is raised, in this case it is 70%
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “state”: false current state of the alert, in this case it is on
- “time”: 1434283480 can be ignored
- “type”: “ephemeral_storage” is the code name identifier of this specific event, see List of Logged Events and Alerts.
Odd number of nodes with a minimum of three nodes alert¶
“Alert on” log entry sample¶
Example:
daemon.warning: Jun 14 15:25:00 node1 event_log[8310]: {"object":"cluster", "state": true, "time": 1434284700, "node_count": 1, "type":"even_node_count"}
The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned on as result of the cluster having only one node.
- Log entry components:
- “daemon.warning”: severity of entry is warning
- “Jun 14 15:25:00”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[8310]”: process id
- “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
- “state”: true current state of the alert, in this case it is on
- “time”: 1434284700 can be ignored
- “node_count”: 1 the number of nodes in the cluster, in this case 1
- “type”: “even_node_count” is the code name identifier of this specific event, see see List of Logged Events and Alerts.
“Alert off” log entry sample¶
Example:
daemon.warning: Jun 14 15:30:40 node1 event_log[8310]: {"object":"cluster", "state": false, "time": 1434285200, "node_count": 3, "type":"even_node_count"}
The log entry above is an example of when the alert for “True high availability requires an odd number of nodes with a minimum of three nodes” has been turned off as result of the cluster having 3 nodes.
- Log entry components:
- “daemon.info”: severity of entry is warning
- “Jun 14 15:30:40”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[8310]”: process id
- “object”: “cluster” the object for which this alert has been raised for, in this case it is the cluster
- “state”: false current state of the alert, in this case it is off
- “time”: 1434285200 can be ignored
- “node_count”: 3 - the number of nodes in the cluster, in this case 3
- “type”: “even_node_count” is the code name identifier of this specific event, List of Logged Events and Alerts.
Node has insufficient disk space for AOF rewrite¶
“Alert on” log entry sample¶
Example:
daemon.err: Jun 15 13:51:23 node1 event_log[34252]: {"used": 23457188, "missing": 604602126, "object": "node:1", "free": 9867264, "needed":637926578, "state": true, "time": 1434365483, "disk": 705667072, "type":"insufficient_disk_aofrw"}
The log entry above is an example of when the alert for “Node has insufficient disk space for AOF rewrite” has been turned on as result of not having enough persistent storage disk space for AOF rewrite purposes. It is missing 604602126 bytes.
- Log entry components:
- “daemon.err”: severity of entry is err
- “Jun 15 13:51:23”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[34252]”: process id
- “used”: 23457188 the amount of disk space in bytes currently used for AOF files
- “missing”: 604602126 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “free”: 9867264 the amount of disk space in bytes that is currently free
- “needed”: 637926578 the amount of total disk space in bytes that is needed for AOF rewrite purposes
- “state”: true current state of the alert, in this case it is on
- “time”: 1434365483 can be ignored
- “disk”: 705667072 the total size in bytes of the persistent storage
- “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.
“Alert off” log entry sample¶
Example:
daemon.info: Jun 15 13:51:11 node1 event_log[34252]: {"used": 0, "missing":-21614592, "object": "node:1", "free": 21614592, "needed": 0, "state":false, "time": 1434365471, "disk": 705667072, "type":"insufficient_disk_aofrw"}
- Log entry components:
- “daemon.info”: severity of entry is info
- “Jun 15 13:51:11”: the timestamp of the event
- “node1”: machine name
- “event_log”: static text that always appears
- “[34252]”: process id
- “used”: 0 the amount of disk space in bytes currently used for AOF files
- “missing”: -21614592 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes, in this case it is not missing because the number is negative
- “object”: “node:1” the object for which this alert has been raised for, in this case it is node with id 1
- “free”: 21614592 the amount of disk space in bytes that is currently free
- “needed”: 0 the amount of total disk space in bytes that is needed for AOF rewrite purposes, in this case no space is needed
- “state”: false current state of the alert, in this case it is off
- “time”: 1434365471 can be ignored
- “disk”: 705667072 the total size in bytes of the persistent storage
- “type”: “insufficient_disk_aofrw” is the code name identifier of this specific event, List of Logged Events and Alerts.
List of Logged Events and Alerts¶
Events and Alerts that may be written to syslog
UI Events¶
Logged events and alerts that appear in the UI.
Event code-name | Event as shown in the ui | Object type | Category | Severity | Notes |
---|---|---|---|---|---|
aof_slow_disk_io | Redis performance is degraded as result of disk I/O limits | node | alert | True: error, False: info | |
backup_delayed | Periodic backup has been delayed for longer than <threshold> minutes | bdb | alert | True: warning, False: info | Has threshold parameter in the data: section of the log entry. |
cpu_utilization | CPU utilization has reached <threshold>% | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
ephemeral_storage | Ephemeral storage has reached <threshold>% of its capacity | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
even_node_count | True high availability requires an odd number of nodes with a minimum of three nodes | cluster | alert | True: warning, False: info | |
failed | Node failed | node | alert | critical | |
free_flash | Flash storage has reached <threshold>% of its capacity | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
high_latency | Latency is higher than <threshold> msec | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
high_syncer_lag | Replica of - sync lag is higher than <threshold> seconds | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
high_throughput | Throughput is higher than <threshold> RPS (requests per second) | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
inconsistent_redis_sw | Not all databases are running the same open source version | cluster | alert | True: warning, False: info | |
inconsistent_rl_sw | Not all nodes in the cluster are running the same Redis Labs Enterprise Cluster version | cluster | alert | True: warning, False: info | |
insufficient_disk_aofrw | Node has insufficient disk space for AOF rewrite | node | alert | True: error, False: info | |
internal_bdb | Issues with internal cluster databases | cluster | alert | True: warning, False: info | |
low_throughput | Throughput is lower than <threshold> RPS (requests per second) | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
memory | Node memory has reached <threshold>% of its capacity | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
multiple_nodes_down | Multiple cluster nodes are down - this might cause data loss | cluster | alert | True: warning, False: info | |
net_throughput | Network throughput has reached <threshold>MB/s | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
node_joined | Node joined | cluster | event | info | |
node_remove_abort_completed | Node removed | cluster | event | info | The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. |
node_remove_abort_failed | Node removed | cluster | event | error | The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. |
node_remove_completed | Node removed | cluster | event | info | The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. |
node_remove_failed | Node removed | cluster | event | error | The remove node is a process that can fail and can also be aborted. If aborted, the abort can succeed or fail. |
persistent_storage | Persistent storage has reached <threshold>% of its capacity | node | alert | True: warning, False: info | Has global_threshold parameter in the key/value section of the log entry. |
ram_dataset_overhead | RAM Dataset overhead in a shard has reached <threshold>% of its RAM limit | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
ram_overcommit | Cluster capacity is less than total memory allocated to its databases | cluster | alert | True: error, False: info | |
ram_values | Percent of values in a shard’s RAM is lower than <threshold>% of its key count | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
shard_num_ram_values | Number of values in a shard’s RAM is lower than <threshold> values | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
size | Dataset size has reached <threshold>% of the memory limit | bdb | alert | True: warning, False: info | Has threshold parameter in the key/value section of the log entry. |
syncer_connection_error | Replica of - database unable to sync with source | bdb | alert | error | |
syncer_general_error | Replica of - database unable to sync with source | bdb | alert | error | |
too_few_nodes_for_replication | Database replication requires at least two nodes in cluster | cluster | alert | True: warning, False: info |
Non UI Events¶
Logged events and alerts that do not appear in the UI
Event code-name | Object type | Category | Severity | Notes |
---|---|---|---|---|
authentication_err | bdb | event | error | Replica of - error authenticating with the source database |
backup_failed | bdb | event | error | |
backup_started | bdb | event | info | |
backup_succeeded | bdb | event | info | |
bdb_created | bdb | event | info | |
bdb_deleted | bdb | event | info | |
bdb_updated | bdb | event | info | indicates that a bdb configuration has been updated |
checks_error | node | event | error | Indicates that one or more node checks have failed |
cluster_updated | cluster | event | info | indicates that cluster settings have been updated |
compression_unsup_err | bdb | event | error | Replica of - Compression not supported by sync destination |
crossslot_err | bdb | event | error | Replica of - sharded destination does not support operation executed on source |
export_failed | bdb | event | error | |
export_started | bdb | event | info | |
export_succeeded | bdb | event | info | |
import_failed | bdb | event | error | |
import_started | bdb | event | info | |
import_succeeded | bdb | event | info | |
license_added | cluster | event | info | |
license_deleted | cluster | event | info | |
license_updated | cluster | event | info | |
node_abort_remove_request | node | event | info | |
node_remove_request | node | event | info | |
oom_err | bdb | event | error | Replica of - Replication source/target out of memory |
user_created | user | event | info | |
user_deleted | user | event | info | |
user_updated | user | event | info | indicates that a user configuration has been updated |