Version: 6.1

Disk Watermark Configuration

Overview

The Disk Watermark mechanism is a three-level protection system designed to:

preventively limit data writing to nodes
initiate shard redistribution
completely block writes in critical situations

The mechanism's goal is to preserve the integrity of existing indices when disk space is exhausted.

Mechanism Architecture

The cluster.routing.allocation.disk.watermark.* settings are defined at the cluster level. This is a single, global rule stored in cluster settings (cluster state) and applied uniformly to all data nodes. You cannot set different fill percentages for different nodes using these parameters.

Checking and reaction occur independently on each node. Each node, upon receiving the unified cluster setting, continuously compares this value with the state of its local data disk. The decision to exceed the threshold and corresponding actions (refusing to place new shards, moving shards, transitioning to read-only) are made by the node individually based on its own metrics.

All requests are executed through the Developer Console (Main Menu - System Parameters - Developer Console), or via curl request.

Request example
curl -k -X PUT "https://<opensearch_host>:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -u 'admin:admin' \
  -d '{
    "persistent": {
      "cluster.routing.allocation.disk.watermark.low": "80%",
      "cluster.routing.allocation.disk.watermark.high": "85%",
      "cluster.routing.allocation.disk.watermark.flood_stage": "90%"
    }
  }'

Three-Level Cluster Protection System Against Disk Overflow

Level 1
Level 2
Level 3

Parameter: cluster.routing.allocation.disk.watermark.low

Default value: 85%

Behavior:

node is excluded from candidates for new shard placement
master node stops assigning new indices and replicas to this node
existing writes continue

Purpose: Warning administrators about approaching critical disk fill levels.

Parameter: cluster.routing.allocation.disk.watermark.high

Default value: 90%

Behavior:

master node initiates shard movement from overloaded nodes
when Shard Allocation Awareness is enabled, failure zones are considered

Critical nuance: Movement is only possible for shards with replicas. Primary shards without replicas cannot be moved.

Risks: The relocation process consumes significant resources (IOPS, network, CPU)

note

Shard Allocation Awareness is a mechanism that ensures primary shards and their replicas are not located on the same physical device or failure zone.

How the relocation mechanism works when using Shard Allocation Awareness parameter. If an administrator sets, for example, cluster.routing.allocation.awareness.attributes: zone, with node cluster settings configured as node.attr.zone: zoneA, node.attr.zone: zoneB, the cluster will ensure that copies of the same shard are located in different zones (zone-a and zone-b). When the high threshold is triggered on a node in zone-a, the cluster cannot simply move the replica to another node in the same zone-a. It will be forced to look for a node in zone-b that meets allocation conditions. This may slow down evacuation or make it temporarily impossible if there is insufficient space in zone-b.

Parameter: cluster.routing.allocation.disk.watermark.flood_stage

Default value: 95%

Behavior:

all indices with shards on the affected node are switched to read_only_allow_delete mode
operations requiring disk space are prohibited
only data deletion is allowed

Purpose: Preventing damage to existing shards when disk space is exhausted.

note

The index.blocks.read_only_allow_delete block is set directly in the index metadata stored in the cluster state. As a result, the write restriction applies to all nodes containing shards of this index, regardless of the fill level of their local disks. Thus, even with replicas on healthy nodes that haven't reached threshold values, writing to the index will be completely stopped until the administrator forcibly removes the block.

Check current cluster settings:

GET _cluster/settings?include_defaults=true

Apply required disk watermark settings:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}

Risks of Using Percentage-Based Threshold Values

Using default percentage values (85%, 90%, 95%) involves certain risks related to the fact that relative values do not reflect the absolute volume of free space remaining available on the device when protective mechanisms are triggered. This discrepancy can lead to abnormal shard degradation before reaching the calculated threshold value.

Example:

A Search Anywhere Framework server with a partition capacity of 2 TB.

with flood_stage threshold set to 95%, write blocking is activated only after the unallocated space volume decreases to 100 GB
meanwhile, a shard of 50 GB may be located on the node, for which a forced segment merge operation is planned. During this operation, the Apache Lucene library temporarily reserves disk space comparable to the size of the merged shard
due to this discrepancy, there is a risk of emergency write termination with a No space left on device error and, consequently, shard damage before the disk fill sensor detects 95% and initiates protective blocking

Configuration Recommendation

For large-capacity partitions, it is advisable to use absolute threshold values (Mb, Gb). This approach guarantees the presence of a pre-calculated reserve of free space sufficient for correct execution of resource-intensive background operations on large shards.

Example configuration with absolute values:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "200gb",
    "cluster.routing.allocation.disk.watermark.high": "100gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "50gb"
  }
}

When setting absolute values, the following condition must be observed: the low threshold value should not exceed the physical capacity of the smallest disk device in the cluster. Failure to observe this condition will result in the node with the minimum disk size being excluded from the allocation process and completely stopping shard acceptance.

note

In a cluster with disks of different volumes (e.g., 500 GB and 3 TB), configuring thresholds in percentage terms can be critical: 10% of 500 GB is 50 GB, while 10% of 3 TB is 300 GB. Balancing will work unpredictably. It is recommended to use absolute values. Generally, in a Search Anywhere Framework cluster, it is recommended to use data nodes with identical disk storage sizes.

Difference between `transient` and `persistent` in Search Anywhere Framework Cluster Settings

When configuring a cluster, the following parameters can be used:

transient (temporary settings):

applied immediately but not saved after cluster restart
have priority over persistent: if the same parameter is set in both sections, the value from transient is used
suitable for operational changes, testing, temporary problem resolution
reset to default values or to persistent after node restarts

persistent (permanent settings):

saved in the cluster index and preserved after node restarts
act permanently until explicitly changed or reset (set to null)
applied immediately but can be overridden by transient settings

Incident Response Actions

When flood_stage is triggered, the cluster switches indices to read-only mode. The main thing to consider when resolving the issue is that the block is not automatically removed when expanding the disk or deleting data.

1. Diagnostics

Check cluster status:

GET _cluster/health?pretty

Possible values for the status parameter:

green — all shards are distributed
yellow — some replicas are not distributed
red — some primary shards are not distributed, data is unavailable

Check disk and shard status on nodes:

GET _cat/allocation?v&h=node,disk.used_percent,disk.used,disk.avail,disk.total,shards

Example command output:

node        	  disk.used   disk.avail    disk.total   shards
smos-node-02     251gb        725gb         976gb         1953
smos-node-00     276.4gb     671.4gb      947.9gb      1952

Find indices occupying the most space:

GET _cat/indices?v&h=index,store.size&s=store.size:desc

If index data is not critical, they can be deleted:

DELETE /<index-name>
DELETE /sm_servers_hosts_uptime-000405
DELETE /sm_servers_hosts_uptime-2025*

2. Freeing Physical Space

The primary task is to eliminate the root cause: either physically expand disk space (via LVM, cloud disk) or clean up old indices/documents to reduce disk fill below the flood_stage threshold.

Temporary measure if there's no immediate possibility to expand the disk:

Raise the flood_stage threshold through transient settings (will reset after cluster restart):

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}

But with high data flow and large shard sizes, this won't help.

3. Forced Block Removal

Execute a request to remove the read_only_allow_delete flag from all cluster indices:

PUT */_settings
{
  "index.blocks.read_only_allow_delete": null
}

To remove the block from a specific index:

PUT /<index-name>/_settings
{
  "index.blocks.read_only_allow_delete": null
}

Check for blocked indices:

GET _all/_settings?filter_path=**.blocks*

In the command output, a blocked index has the parameter:

"blocks": {
  "read_only_allow_delete": "true"
}

4. Restoring Stuck Shards

After removing the block, check the cluster status again:

GET _cluster/health?pretty

Cluster recovery may take a significant amount of time.

If the status remains red or yellow, and the number of unassigned shards does not decrease or has reached a certain value and is not changing—this means that allocation attempts have failed.

Check unassigned shard status:

GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

Check allocation status:

GET _cluster/allocation/explain?pretty

To diagnose a shard of a specific index <index-name>:

GET _cluster/allocation/explain
{
  "index": "<index-name>",
  "shard": 0,
  "primary": true
}

Initiate forced redistribution. Executes one attempt, can be repeated if necessary:

POST _cluster/reroute?retry_failed=true

5. Accelerating Recovery (Optional)

By default, nodes report disk usage every 30 seconds. To make the master node "see" that space has appeared faster and remove the restriction from the node in RAM, you can temporarily reduce the interval:

PUT _cluster/settings
{
  "persistent": {
    "cluster.info.update.interval": "15s"
  }
}

Additionally, you can increase limits on simultaneous shard recoveries:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 4,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 4,
    "indices.recovery.max_bytes_per_sec": "150mb"
  }
}

After completing recovery work, be sure to return to default values to avoid creating excessive load on the cluster:

PUT _cluster/settings
{
  "transient": {
    "cluster.info.update.interval": null,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": null,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": null,
    "indices.recovery.max_bytes_per_sec": null
  }
}

Resetting disk watermark parameters to default values

If the configuration was unsuccessfully changed, you can always roll back to default values.

Since you use persistent, settings will be preserved even after cluster restart. However, if you previously changed them through transient, transient has higher priority—and resetting persistent to null won't have a visible effect while transient values are active.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": null,
    "cluster.routing.allocation.disk.watermark.high": null,
    "cluster.routing.allocation.disk.watermark.flood_stage": null
  },
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": null,
    "cluster.routing.allocation.disk.watermark.high": null,
    "cluster.routing.allocation.disk.watermark.flood_stage": null
  }
}

How `cluster.routing.rebalance.enable` and watermark are related

The cluster.routing.rebalance.enable parameter controls:

planned balancing — even distribution of shards across nodes for optimal performance
triggered when adding/removing nodes, changing node weights
does not affect emergency disk protection mechanisms

If cluster.routing.rebalance.enable: none is set, the system will still move shards from a full disk when the high watermark is exceeded. This parameter is not a way to prevent shard movement—to do that, you would need to change the thresholds themselves (watermark.high) or disable disk monitoring through cluster.routing.allocation.disk.threshold_enabled: false (which is categorically not recommended).

Overview​

Mechanism Architecture​

Three-Level Cluster Protection System Against Disk Overflow​

Risks of Using Percentage-Based Threshold Values​

Configuration Recommendation​

Difference between transient and persistent in Search Anywhere Framework Cluster Settings​

Incident Response Actions​

1. Diagnostics​

2. Freeing Physical Space​

3. Forced Block Removal​

4. Restoring Stuck Shards​

5. Accelerating Recovery (Optional)​

Resetting disk watermark parameters to default values​

How cluster.routing.rebalance.enable and watermark are related​

Overview

Mechanism Architecture

Three-Level Cluster Protection System Against Disk Overflow

Risks of Using Percentage-Based Threshold Values

Configuration Recommendation

Difference between `transient` and `persistent` in Search Anywhere Framework Cluster Settings

Incident Response Actions

1. Diagnostics

2. Freeing Physical Space

3. Forced Block Removal

4. Restoring Stuck Shards

5. Accelerating Recovery (Optional)

Resetting disk watermark parameters to default values

How `cluster.routing.rebalance.enable` and watermark are related