RLEC rsyslog logging documentation
##################################

Version: Feb 11, 2016

General
=======
This documents explains the structure of RLEC log entries that go into rsyslog and how to use these log entries to identify events.

Logging concepts
================

RLEC adds to the log different log entries, from various cluster components, for various events and actions that take place in the cluster.
There might be many different log entries for a specific event that can be perceived externally as a single event.
For example, in order for the cluster to decide that a cluster node is down there might be various log entries added by different cluster components from various nodes with different descriptions, until the cluster gets to a final decision that the node is actually down. In other cases, similar entries might be added to the log and the cluster will eventually get to a decision that the node is actually not down.
In addition, some actions that might seem to the user as an atomic action, like removing a node from the cluster, are actually made up of several different events that take place in a sequence, and might also fail in the process.
As a result, Redis Labs enabled that by default all logs entries that are shown in the log page in the management UI will also be written to syslog. Then rsyslog can be configured to monitor the syslog. All alerts are logged to syslog if the alerts are configured to be enabled, in addition to other log entries.
The log entries can be categorized into events and alerts . Events just get logged, while alerts have a state attached to them.
RLEC log entries include information about the specific event that occurred as detailed below in the Log entry structure section. In addition, rsyslog can be configured to add other information, like the event severity for example.
Since rsyslog entries do not include the severity information by default, you can use the following instructions in order to log that information (in Ubuntu):

* Add the following line to **/etc/rsyslog.conf**:
    ``$template TraditionalFormatWithPRI,"%pri-text%: %timegenerated% %HOSTNAME% %syslogtag%%msg:::drop-last-lf%\n"``

* modify ``$ActionFileDefaultTemplate`` to use your new template ``$ActionFileDefaultTemplate TraditionalFormatWithPRI``

Make sure to save the changes and restart rsyslog in order for the changes to take effect.
you can see the alerts & events under /var/log in messages log file.

Command components
-------------------
* **%pritext%** - adds the severity
* **%timegenerated%** - adds the timestamp
* **%HOSTNAME%** - adds the machine name
* **%syslogtag%** - the RLEC message as detailed below in the Log entry structure section below.
* **%msg:::droplastlf%\n** removes duplicated log entries

Log entry structure
===================
The log entries have the following basic structure::

    event_log[<process id>]:{<list of key value pairs in any order>}

* event_log - plain static text that will always show at the beginning of the entry.
* process id - the id of the process the logging in running under.
* list of key value pairs in any order a list of key value pairs describing the specific event. The keyvalues pairs can appear in any order. Some keyvalue pairs will always appear, and some appear depending on the specific event.
	* Keyvalue pairs that will always appear:
		* "type" - A unique codename identifying the event logged. For the list of codenames see :ref:`Events-List`.
		* "object" - has the format of "<object type>[:<id>]". Defines the object type, and id if relevant, of the object this event is related to. For example cluster, node with id, bdb with id, etc’.
		* "time" - unix time, can be ignored in this context.
	* Keyvalue pairs that might appear depending on the specific entry:
		* "state" - boolean with value true or false. This is relevant only for entries from category alert. True means that the alert is on. False means that the alert is off.
		* "global_threshold" - a value of a threshold for alerts related to the "cluster" or "node" objects.
		* "threshold" - a value of a threshold for alerts related to the "bdb" object.

Log entry samples
=================
Below are examples for log entries that include the rsyslog configuration mentioned above that add the severity, timestamp and machine name.

Ephemeral storage passed threshold
----------------------------------

"Alert on" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.warning: Jun 14 14:49:20 node1 event_log[3464]:{"storage_util": 90.061643120001, "global_threshold": "70", "object": "node:1", "state": true, "time": 1434282560, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 "Ephemeral storage has reached 70% of its capacity" has been raised as result of storage utilization reaching the value of ~90%.

* Log entry components:
	* daemon.warning: severity of entry is warning
	* Jun 14 14:49:20:  the timestamp of the event
	* node1: machine name
	* event_log: - static text that always appears
	* [3464]: process id
	* "storage_util": 90.061643120001 current ephemeral storage utilization, in this case ~90%
	* "global_threshold": "70" the user configured threshold above which the alert is raised, in this case it is 70%
	* "object": "node:1" the object for which this alert has been raised for, in this case it is node with id 1
	* "state": true current state of the alert, in this case it is on
	* "time": 1434282560 can be ignored
	* "type": "ephemeral_storage" is the code name identifier of this specific event, see :ref:`Events-List`.

"Alert off" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.info: Jun 14 14:51:35 node1 event_log[3464]: {"storage_util":60.051723520008, "global_threshold": "70", "object": "node:1", "state":false, "time": 1434283480, "type": "ephemeral_storage"}

The log entry above is an example of when the alert for node with id 1 "Ephemeral storage has reached 70% of its capacity" has been turned off as result of storage utilization reaching the value of ~60%.

* Log entry components:
	* "daemon.info": severity of entry is info
	* "Jun 14 14:51:35": the timestamp of the event
	* "node1": machine name
	* "event_log": static text that always appears
	* "[3464]": process id
	* "storage_util": 60.051723520008 current ephemeral storage utilization, in this case ~60%
	* "global_threshold": "70" the user configured threshold above which the alert is raised, in this case it is 70%
	* "object": "node:1" the object for which this alert has been raised for, in this case it is node with id 1
	* "state": false current state of the alert, in this case it is on
	* "time": 1434283480 can be ignored
	* "type": "ephemeral_storage" is the code name identifier of this specific event, see :ref:`Events-List`.

Odd number of nodes with a minimum of three nodes alert
-------------------------------------------------------

"Alert on" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.warning: Jun 14 15:25:00 node1 event_log[8310]: {"object":"cluster", "state": true, "time": 1434284700, "node_count": 1, "type":"even_node_count"}

The log entry above is an example of when the alert for "True high availability requires an odd number of nodes with a minimum of three nodes" has been turned on as result of the cluster having only one node.

* Log entry components:
	* "daemon.warning": severity of entry is warning
	* "Jun 14 15:25:00": the timestamp of the event
	* "node1": machine name
	* "event_log": static text that always appears
	* "[8310]": process id
	* "object": "cluster" the object for which this alert has been raised for, in this case it is the cluster
	* "state": true current state of the alert, in this case it is on
	* "time": 1434284700 can be ignored
	* "node_count": 1 the number of nodes in the cluster, in this case 1
	* "type": "even_node_count" is the code name identifier of this specific event, see see :ref:`Events-List`.

"Alert off" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.warning: Jun 14 15:30:40 node1 event_log[8310]: {"object":"cluster", "state": false, "time": 1434285200, "node_count": 3, "type":"even_node_count"}

The log entry above is an example of when the alert for "True high availability requires an odd number of nodes with a minimum of three nodes" has been turned off as result of the cluster having 3 nodes.

* Log entry components:
	* "daemon.info": severity of entry is warning
	* "Jun 14 15:30:40": the timestamp of the event
	* "node1": machine name
	* "event_log": static text that always appears
	* "[8310]": process id
	* "object": "cluster" the object for which this alert has been raised for, in this case it is the cluster
	* "state": false current state of the alert, in this case it is off
	* "time": 1434285200 can be ignored
	* "node_count": 3 - the number of nodes in the cluster, in this case 3
	* "type": "even_node_count" is the code name identifier of this specific event, :ref:`Events-List`.

Node has insufficient disk space for AOF rewrite
------------------------------------------------

"Alert on" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.err: Jun 15 13:51:23 node1 event_log[34252]: {"used": 23457188, "missing": 604602126, "object": "node:1", "free": 9867264, "needed":637926578, "state": true, "time": 1434365483, "disk": 705667072, "type":"insufficient_disk_aofrw"}

The log entry above is an example of when the alert for "Node has insufficient disk space for AOF rewrite" has been turned on as result of not having enough persistent storage disk space for AOF rewrite purposes. It is missing 604602126 bytes.

* Log entry components:
	* "daemon.err": severity of entry is err
	* "Jun 15 13:51:23": the timestamp of the event
	* "node1": machine name
	* "event_log": static text that always appears
	* "[34252]": process id
	* "used": 23457188 the amount of disk space in bytes currently used for AOF files
	* "missing": 604602126 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes
	* "object": "node:1" the object for which this alert has been raised for, in this case it is node with id 1
	* "free": 9867264 the amount of disk space in bytes that is currently free
	* "needed": 637926578 the amount of total disk space in bytes that is needed for AOF rewrite purposes
	* "state": true current state of the alert, in this case it is on
	* "time": 1434365483 can be ignored
	* "disk": 705667072 the total size in bytes of the persistent storage
	* "type": "insufficient_disk_aofrw" is the code name identifier of this specific event, :ref:`Events-List`.

"Alert off" log entry sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example::

    daemon.info: Jun 15 13:51:11 node1 event_log[34252]: {"used": 0, "missing":-21614592, "object": "node:1", "free": 21614592, "needed": 0, "state":false, "time": 1434365471, "disk": 705667072, "type":"insufficient_disk_aofrw"}

* Log entry components:
	* "daemon.info": severity of entry is info
	* "Jun 15 13:51:11": the timestamp of the event
	* "node1": machine name
	* "event_log": static text that always appears
	* "[34252]": process id
	* "used": 0 the amount of disk space in bytes currently used for AOF files
	* "missing": -21614592 - the amount of disk space in bytes that is currently missing for AOF rewrite purposes, in this case it is not missing because the number is negative
	* "object": "node:1" the object for which this alert has been raised for, in this case it is node with id 1
	* "free": 21614592 the amount of disk space in bytes that is currently free
	* "needed": 0 the amount of total disk space in bytes that is needed for AOF rewrite purposes, in this case no space is needed
	* "state": false current state of the alert, in this case it is off
	* "time": 1434365471 can be ignored
	* "disk": 705667072 the total size in bytes of the persistent storage
	* "type": "insufficient_disk_aofrw" is the code name identifier of this specific event, :ref:`Events-List`.


.. _Events-List:

List of Logged Events and Alerts
================================

Events and Alerts that may be written to syslog

.. automodule:: cnm.events_doc