Version: 5.1

Self-Monitoring

Self-Monitoring is a module that includes a set of dashboards for centralized cluster health monitoring. It enables timely detection and resolution of cluster and data collection layer anomalies before failures occur.

Conventions

SAF_INSTALLER - Directory where the Search Anywhere Framework installation package is extracted
USER - System administrator user, typically admin
OS_HOME - OpenSearch home directory, typically /app/opensearch/
OSD_HOME - OpenSearch Dashboards home directory, typically /app/opensearch-dashboards/
LOGSTASH_HOME - Logstash home directory, typically /app/logstash/
SBM_HOME - Search Anywhere Framework Beat Manager home directory, typically /app/safBeatManager/
SB_HOME - Search Anywhere Framework Beat home directory, typically /app/safBeat/

Useful Links

Self-Monitoring Usage

The module consists of dashboards displaying various critical metrics of the target (monitored) cluster, enabling rapid response to any changes or issues.

Monitored components include:

SA Data Storage: Tracks all processes related to cluster operations
SA Data Collector: Monitors event ingestion and proper data collection
SA Master Node: Monitors control components status and cluster coordination

Architecture

Self-monitoring can be deployed using two distinct architecture types:

Deployment on a dedicated monitoring cluster
Deployment within the target cluster

Terminology

Target cluster - the cluster being monitored for metrics.

Type I Architecture (Dedicated Monitoring Cluster)

Recommendations

We recommend deploying all self-monitoring server components on a single node, particularly for small to medium-sized solutions - this simplifies deployment and maintenance.

Minimum self-monitoring server specifications:

CPU: from 4 cores
RAM: from 16 GB
Storage: from 100 GB

Storage capacity depends on: Volume of collected data,number of monitored clusters, data retention period Parameters can be scaled up as needed.

In Type I architecture, the self-monitoring server is a fully isolated cluster comprising:

SA Data Storage - Handles data storage and full-text search
SA Dashboards - Provides data visualization
SA Data Collector - Receives data from target cluster agents and polls via REST API
SAF Beat Manager - Manages agents installed on target clusters

Advantages

Deploying the self-monitoring server on a dedicated cluster offers key benefits:

Availability during target cluster outages
- The independent self-monitoring server remains operational even if the target cluster fails
Reduced target cluster load
- Offloading monitoring eliminates resource contention. Monitoring processes are resource-intensive by nature. Separate infrastructure ensures stable production cluster performance

Disadvantages

While this is the recommended default configuration, consider these factors:

Additional infrastructure resources
- Requires dedicated servers/compute resources. May increase operational costs
Increased configuration complexity
- To get self-monitoring data from the target cluster, you need to configure the cross-cluster search mechanism (for configuration information, see useful links). This adds steps to the configuration process and requires special attention when maintaining the system

Type II Architecture (Primary Cluster)

In the second architecture type, all previously described operations occur within a single cluster where self-monitoring is deployed.

Advantages

No need for additional servers or their configuration
- Eliminates the requirement to allocate and configure extra resources for self-monitoring servers, potentially reducing infrastructure costs and simplifying deployment
Immediate access to self-monitoring data on the primary cluster
- Self-monitoring data becomes instantly available within the target cluster, facilitating easier access and analysis for system operators and administrators

Disadvantages

Self-monitoring unavailability during cluster failures
- If the primary cluster experiences a critical failure, the self-monitoring server also becomes unavailable, potentially complicating system status and performance monitoring
Additional load on the primary cluster
- Running self-monitoring on the primary cluster may impose extra resource demands, which could impact application or service performance

Data Collection

The data collected by self-monitoring can be categorized into two types:

Log files
Metrics and other statistics

Log Collection

Log collection is performed from all nodes of the SA Data Storage, SA Data Collector, and SA Master Node in the target cluster. SAF Beat agents are installed and configured on the target cluster hosts (installation instructions can be found in the useful links section). During configuration, the SAF Beat Manager address of the self-monitoring server is specified, which contains the necessary configurations for data collection and transmission agents. These components will be installed and running on the cluster hosts. For log collection, Filebeat is configured to collect the following logs:

For SA Data Storage hosts:

Cluster logs
sme logs
sme-re logs
job scheduler logs

For SA Data Collector hosts:

SA Data Collector logs
SA Data Collector pipeline logs

Metrics Collection

For collecting metrics from SA Data Storage, the http_poller input plugin in SA Data Collector is used. This plugin periodically polls specified REST endpoints.

note

In the self-monitoring pipeline templates, you can see that the http_poller plugin can send identical requests to all master nodes of the target cluster and then use the throttle filter plugin to filter duplicate responses. This ensures data retrieval even if one or more master nodes fail, as requests will be executed to the remaining operational nodes.

For collecting metrics from SA Data Collector, Metricbeat is used.

Configuration

The self-monitoring package includes scripts for pipeline generation, agent configurations, and automation of other necessary operations:

generate_pipelines.py: This script generates pipelines and agent configurations. It automatically creates the required elements for data collection and agent setup
generate_opensearch_configs.py: This script generates Index State Management (ISM) policies, creates index templates, and copies dashboards. It can also connect to SA Data Storage and create corresponding policies, index templates, and indices when needed
import_certs.sh: This script adds host certificates to the truststore, which is essential for establishing secure TLS connections when SA Data Collector accesses target hosts via API

Configuration file

All the scripts mentioned above extract the settings from the configuration file config.ini.

The fields of the configuration file are described in the table below.:

Field	Parameter
rest	Data for loading ISM policies, index templates, and index creation in SA Data Storage (for the script `generate_opensearch_config.py`): user = `admin` — SA Data Storage user (the password will be requested interactively by the script) master_node = `127.0.0.1` — any master node of the self-monitoring server for accessing the REST API dashboards_node = `127.0.0.1` — the address of the SA Web node
ism	Data for configuring the ISM policy for indexes with self-monitoring data: patterns = `clusterhealth-`, `clusterlogs-`, `sme_re_logs-`, `job_scheduler_logs-`, `job_scheduler_audit_logs-`, `core_logs-`, `clusterstats-`, `logstashstats-`, `logstashlogs-`, `shardstats-`, `indexstats-`, `cluster_query_types-` — index patterns policy_id = `selfmonitoring` — ID politicians age_to_rollover = `7d` — age of the index for rollover primary_size_to_rollover = `10gb` — the size of the primary shard for rollover rollover_retry_count = `5` — number of repetitions rollover rollover_retry_backoff = `constant` — the function of rolling back repeated attempts rollover rollover_retry_delay = `5m` — delay between retries rollover age_to_delete = `30d` — age of the index to delete delete_retry_count = `3` — number of repeated deletion attempts delete_retry_backoff = `linear` — the function of rolling back repeated deletion attempts delete_retry_delay = `1h` — delaying repeated deletion attempts delete_timeout = `1h` — deletion attempt timeout
index_templates	Data for creating index templates with self-monitoring data: indexes = `clusterhealth`, `clusterlogs`, `sme_re_logs`, `job_scheduler_logs`, `job_scheduler_audit_logs`, `core_logs`, `clusterstats`, `logstashstats`, `logstashlogs`, `shardstats`, `indexstats`, `cluster_query_types` — list of aliases routing_mode = `warm` — node type for allocation number_of_shards = `1` — number of primary shards number_of_replicas = `0` — number of replicas of shards
poller	Data for the `http_poller` input plugin: request_period = `60` — frequency of polling of nodes of the target cluster (in seconds) master_nodes = `172.16.0.1`, `172.16.0.2`, `172.16.0.3` — nodes in the target cluster with the `master` role user = `admin` — the SA Data Storage user under which RestAPI requests to the target cluster will be executed pwd_token = `os_pwd` — keystore token for the password of the SA Data Storage user of the target cluster for executing RestAPI requests
logstash	Configuration parameters of the SA Data Collector on the self-monitoring server: user = `admin` — the user under whom the self-monitoring server will be sent to SA Data Storage pwd_token = `os_pwd` — keystore token for the password to the SA Data Storage self-monitoring user data_nodes = `172.16.0.4`, `172.16.0.5`, `172.16.0.6` — a list of nodes with the `data` role of the self-monitoring server for sending data to them scripts_path = `/app/logstash/config/conf.d/scripts/` — location of the `script` directory containing auxiliary scripts for the pipeline SA Data Collector truststore_path = `/app/logstash/config/truststore.jks` — the location of the truststore with the imported certificates of the target cluster hosts truststore_pwd_token = `ts_pwd` — keystore token for the password truststore ca_cert_path = `/app/logstash/config/ca-cert.pem` — location of the CA certificate of the self-monitoring server node_cert_path = `/app/logstash/config/node-cert.pem` — location of the node certificate for SA Data Collector self-monitoring node_key_path = `/app/logstash/config/node-key.pem` — location of the key to the node certificate for SA Data Collector self-monitoring
beats	Agent Configuration Settings: logstash_ips = `172.17.0.10` — the address of the SA Data Collectors of the self-monitoring cluster to which data will be sent from the target cluster filebeat_selfmon_port = `5050` — port `filebeat`, sending logs SA Data Storage filebeat_logstash_port = `5051` — port `filebeat`, sending logs SA Data Collector metricbeat_logstash_port = `5052` — port metricbeat, the sender of the metric SA Data Collector cert_path = `/app/safBeat/cert.pem` — location of the certificate SAF Beat key_path = `/app/safBeat/key.pem` — location of the key to the certificate SAF Beat ca_cert_path = `/app/safBeat/ca-cert.pem` — location of the CA certificate of the self-monitoring server smos_cluster_log_path = `/app/logs/opensearch/sm-cluster.log` — location of cluster logs sme_log_path = `/app/logs/opensearch/sme.log` — location of logs SA Engine sme_re_log_path = `/app/logs/sme-re/main.log` — location of logs SA Engine RE job_scheduler_log_path = `/app/logs/job_scheduler.log` — location of logs job scheduler job_scheduler_audit_log_path = `/app/logs/job_scheduler_audit.log` — location of audit logs job scheduler core_log_path = `/app/logs/core.log` — location of logs Core logstash_logs_path = `/app/logs/logstash/*.log` — location of logs SA Data Collector

Installation

note

The SA Master Node, SA Data Storage, SA Data Collector, and SAF Beat Manager components (installation information can be found in the Useful Links section) must be properly installed and configured before proceeding with this guide.

The selfmonitoring component is included in the base SAF package and is located in the${SAF_INSTALLER}/utils/selfmonitoring/ directory. We recommend using Python, which is also included in the installer package and located at ${SAF_INSTALLER}/utils/python/bin/python3.

Configuration Setup

Populate the config.ini configuration file with actual parameters.

note

Most parameters related to paths, polling settings, ISM, and index templates can remain unchanged. Required modifications typically involve IP addresses only. However, we strongly recommend carefully reviewing each parameter before use to ensure all values are correct and meet your system's current requirements.

Running Scripts

Generating Files for SA Data Storage

When executed without arguments, the script will generate ISM policies, index templates, and dashboards, saving them to the ${SAF_INSTALLER}/utils/selfmonitoring/generated_data directory:

${SAF_INSTALLER}/utils/python/bin/python3 ${SAF_INSTALLER}/utils/selfmonitoring/generate_opensearch_configs.py

When executed with the --upload argument, the script will both generate files and upload the content to the self-monitoring SA Data Storage (also creating the actual indices):

${SAF_INSTALLER}/utils/python/bin/python3 ${SAF_INSTALLER}/utils/selfmonitoring/generate_opensearch_configs.py --upload

note

You may first run the script without the --upload flag to review the output, then rerun with --upload to upload to SA Data Storage after verifying the results.

Generating Files for SA Data Collector and SAF Beat Manager

Execute the ${SAF_INSTALLER}/utils/selfmonitoring/generate_pipelines.py:

${SAF_INSTALLER}/utils/python/bin/python3 ${SAF_INSTALLER}/utils/selfmonitoring/generate_pipelines.py

After execution, the ${SAF_INSTALLER}/utils/selfmonitoring/generated_data directory will contain new subdirectories with data for SA Data Collector and SAF Beat Manager.

Certificate Import

Execute the ${SAF_INSTALLER}/utils/selfmonitoring/import_certs.sh:

cd ${SAF_INSTALLER}/utils/selfmonitoring/ && chmod +x import_certs.sh
sudo -u logstash ./import_certs.sh

Important

The script must be executed under the logstash user account.

Script execution flow:

Creating a new truststore: When launched, the script will prompt twice for a new password for the certificate store. Remember this password as it will be required in subsequent steps
Certificate retrieval: The script will connect to each host and retrieve certificates. When prompted Trust this certificate? [no]:, answer yes
Password storage: The script will request passwords for ts_pwd and os_pwd tokens. For ts_pwd: Enter the password created in step 1. For os_pwd: Enter the SA Data Storage user password specified in config.ini

If authentication credentials differ between target and monitoring clusters, and a different token is specified in the logstash.pwd_token configuration field, add it manually on the monitoring cluster's SA Data Collector server:

sudo -u logstash ${LOGSTASH_HOME}/bin/logstash-keystore add <TOKEN_NAME>

note

If a truststore has already been created on the SA Data Collector at the time of the script launch, following the path specified in the configuration, the script will ask you for its current password once
If no keystore exists when running the script, you'll see:The keystore password is not set... Continue without password protection on the keystore? [y/N] Answer y. For initialized keystores. With password: Script will prompt for it. Without password: No additional input required

Deploying Generated Files

After completing previous steps, all required files (build artifacts) will be available in ${SAF_INSTALLER}/utils/selfmonitoring/generated_data Below directory references are relative to this location unless absolute paths are specified.

SA Data Storage Configuration

Create ISM policies and index templates using files in the ism and index_templates directories respectively. Then create indices.

note

Skip this step if you automatically uploaded SA Data Storage configurations (by running generate_opensearch_configs.py with the --upload flag)

SA Data Collector Configuration

Copy pipelines from ${SAF_INSTALLER}/utils/selfmonitoring/generated_data/logstash/pipelines/ to ${LOGSTASH_HOME}/config/conf.d/
Transfer scripts from ${SAF_INSTALLER}/utils/selfmonitoring/generated_data/logstash/scripts/ to the directory specified in the logstash.scripts_pathconfiguration parameter (default: ${LOGSTASH_HOME}/config/conf.d/scripts/)
Append the contents of pipelines.yml to ${LOGSTASH_HOME}/config/pipelines.yml
Modify ownership of transferred files

After completing the steps, restart the SA Data Collector service:

sudo chown -R logstash:logstash ${LOGSTASH_HOME}/
sudo systemctl restart logstash.service

Configuration SAF Beat Manager

Copy contents from ${SAF_INSTALLER}/utils/selfmonitoring/generated_data/sbm to ${SBM_HOME}/apps/
Download required binaries: filebeat and metricbeat (contact technical support if download issues occur), place them in ${SBM_HOME}/binaries/
Place them in ${SBM_HOME}/etc/serverclasses.yml. To include the following groups: linux_selfmon and linux_selfmon_logstash:

groups:
  - name: linux_selfmon
    filters:
      - "<IP 1st node opensearch>"
      - "<IP 2nd node opensearch>"
      ...
      - "<IP N-th node opensearch>"
    apps:
      - filebeat_selfmon
    binaries:
      - filebeat-oss-8.9.2-linux-x86_64.tar.gz

  - name: linux_selfmon_logstash
    filters:
      - "<IP 1-st logstash>"
      - "<IP 2-nd logstash>"
      ...
      - "<IP N-th logstash>"
    apps:
      - filebeat_logstash
      - metricbeat_logstash
    binaries:
      - filebeat-oss-8.9.2-linux-x86_64.tar.gz
      - metricbeat-oss-8.9.2-linux-x86_64.tar.gz

Important

Pay special attention to the filters and binaries sections. In filters: For the first group: Specify IP addresses of all target cluster nodes. For the second group: Specify IP addresses of all Logstash instances in the target cluster In the binaries section: Verify the archive names match those you downloaded to ${SBM_HOME}/binaries/.

After completing these steps, restart the SAF Beat Manager service:

sudo systemctl restart safBeatManager.service

Dashboard Deployment

note

SAF Beat must be installed on all SA Data Collector, SA Data Storage, and SA Master Node instances in the target cluster (installation instructions available in Useful Links). Configure SAF Beat to use the monitoring cluster's SAF Beat Manager server.

In the monitoring web interface: Navigate to Dashboards (Main menu - General - Dashboards) Create new dashboards using JSON files from the dashboards directory
Update node selection filters in dashboards with current cluster data
In the Module Settings section, go to Index Templates (Main Menu - System Settings - Module Settings - Index Settings Templates) and create templates similar to index aliases (clusterstats, clusterhealth, etc.)

The self-monitoring installation is now complete.

info

For dedicated monitoring cluster architectures, configure cross-cluster search (configuration details available in Useful Links) to enable queries from the target cluster.

Conventions​

Useful Links​

Self-Monitoring Usage​

Architecture​

Type I Architecture (Dedicated Monitoring Cluster)​

Advantages​

Disadvantages​

Type II Architecture (Primary Cluster)​

Advantages​

Disadvantages​

Data Collection​

Log Collection​

Metrics Collection​

Configuration​

Configuration file​

Installation​

Configuration Setup​

Running Scripts​

Generating Files for SA Data Storage​

Generating Files for SA Data Collector and SAF Beat Manager​

Certificate Import​

Deploying Generated Files​

SA Data Storage Configuration​

SA Data Collector Configuration​

Configuration SAF Beat Manager​

Dashboard Deployment​

Conventions

Useful Links

Self-Monitoring Usage

Architecture

Type I Architecture (Dedicated Monitoring Cluster)

Advantages

Disadvantages

Type II Architecture (Primary Cluster)

Advantages

Disadvantages

Data Collection

Log Collection

Metrics Collection

Configuration

Configuration file

Installation

Configuration Setup

Running Scripts

Generating Files for SA Data Storage

Generating Files for SA Data Collector and SAF Beat Manager

Certificate Import

Deploying Generated Files

SA Data Storage Configuration

SA Data Collector Configuration

Configuration SAF Beat Manager

Dashboard Deployment