Skip to main content
Version: 6.0

Disk Watermark Configuration

Overview

The Disk Watermark mechanism is a three-level protection system designed to:

  • preventively limit data writing to nodes
  • initiate shard redistribution
  • completely block writes in critical situations

The mechanism's goal is to preserve the integrity of existing indices when disk space is exhausted.


Mechanism Architecture

The cluster.routing.allocation.disk.watermark.* settings are defined at the cluster level. This is a single, global rule stored in cluster settings (cluster state) and applied uniformly to all data nodes. You cannot set different fill percentages for different nodes using these parameters.

Checking and reaction occur independently on each node. Each node, upon receiving the unified cluster setting, continuously compares this value with the state of its local data disk. The decision to exceed the threshold and corresponding actions (refusing to place new shards, moving shards, transitioning to read-only) are made by the node individually based on its own metrics.

All requests are executed through the Developer Console (Main Menu - System Parameters - Developer Console), or via curl request.

Request example
curl -k -X PUT "https://<opensearch_host>:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-u 'admin:admin' \
-d '{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "80%",
"cluster.routing.allocation.disk.watermark.high": "85%",
"cluster.routing.allocation.disk.watermark.flood_stage": "90%"
}
}'

Three-Level Cluster Protection System Against Disk Overflow

Parameter: cluster.routing.allocation.disk.watermark.low

Default value: 85%

Behavior:

  • node is excluded from candidates for new shard placement
  • master node stops assigning new indices and replicas to this node
  • existing writes continue

Purpose: Warning administrators about approaching critical disk fill levels.

Check current cluster settings:

GET _cluster/settings?include_defaults=true

Apply required disk watermark settings:

PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}

Risks of Using Percentage-Based Threshold Values

Using default percentage values (85%, 90%, 95%) involves certain risks related to the fact that relative values do not reflect the absolute volume of free space remaining available on the device when protective mechanisms are triggered. This discrepancy can lead to abnormal shard degradation before reaching the calculated threshold value.

Example:

A Search Anywhere Framework server with a partition capacity of 2 TB.

  • with flood_stage threshold set to 95%, write blocking is activated only after the unallocated space volume decreases to 100 GB
  • meanwhile, a shard of 50 GB may be located on the node, for which a forced segment merge operation is planned. During this operation, the Apache Lucene library temporarily reserves disk space comparable to the size of the merged shard
  • due to this discrepancy, there is a risk of emergency write termination with a No space left on device error and, consequently, shard damage before the disk fill sensor detects 95% and initiates protective blocking

Configuration Recommendation

For large-capacity partitions, it is advisable to use absolute threshold values (Mb, Gb). This approach guarantees the presence of a pre-calculated reserve of free space sufficient for correct execution of resource-intensive background operations on large shards.

Example configuration with absolute values:

PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "200gb",
"cluster.routing.allocation.disk.watermark.high": "100gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "50gb"
}
}

When setting absolute values, the following condition must be observed: the low threshold value should not exceed the physical capacity of the smallest disk device in the cluster. Failure to observe this condition will result in the node with the minimum disk size being excluded from the allocation process and completely stopping shard acceptance.

note

In a cluster with disks of different volumes (e.g., 500 GB and 3 TB), configuring thresholds in percentage terms can be critical: 10% of 500 GB is 50 GB, while 10% of 3 TB is 300 GB. Balancing will work unpredictably. It is recommended to use absolute values. Generally, in a Search Anywhere Framework cluster, it is recommended to use data nodes with identical disk storage sizes.

Difference between transient and persistent in Search Anywhere Framework Cluster Settings

When configuring a cluster, the following parameters can be used:

transient (temporary settings):

  • applied immediately but not saved after cluster restart
  • have priority over persistent: if the same parameter is set in both sections, the value from transient is used
  • suitable for operational changes, testing, temporary problem resolution
  • reset to default values or to persistent after node restarts

persistent (permanent settings):

  • saved in the cluster index and preserved after node restarts
  • act permanently until explicitly changed or reset (set to null)
  • applied immediately but can be overridden by transient settings

Incident Response Actions

When flood_stage is triggered, the cluster switches indices to read-only mode. The main thing to consider when resolving the issue is that the block is not automatically removed when expanding the disk or deleting data.

1. Diagnostics

Check cluster status:

GET _cluster/health?pretty

Possible values for the status parameter:

  • green — all shards are distributed
  • yellow — some replicas are not distributed
  • red — some primary shards are not distributed, data is unavailable

Check disk and shard status on nodes:

GET _cat/allocation?v&h=node,disk.used_percent,disk.used,disk.avail,disk.total,shards

Example command output:

node        	  disk.used   disk.avail    disk.total   shards
smos-node-02 251gb 725gb 976gb 1953
smos-node-00 276.4gb 671.4gb 947.9gb 1952

Find indices occupying the most space:

GET _cat/indices?v&h=index,store.size&s=store.size:desc

If index data is not critical, they can be deleted:

DELETE /<index-name>
DELETE /sm_servers_hosts_uptime-000405
DELETE /sm_servers_hosts_uptime-2025*

2. Freeing Physical Space

The primary task is to eliminate the root cause: either physically expand disk space (via LVM, cloud disk) or clean up old indices/documents to reduce disk fill below the flood_stage threshold.

Temporary measure if there's no immediate possibility to expand the disk:

Raise the flood_stage threshold through transient settings (will reset after cluster restart):

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.flood_stage": "98%"
}
}

But with high data flow and large shard sizes, this won't help.

3. Forced Block Removal

Execute a request to remove the read_only_allow_delete flag from all cluster indices:

PUT */_settings
{
"index.blocks.read_only_allow_delete": null
}

To remove the block from a specific index:

PUT /<index-name>/_settings
{
"index.blocks.read_only_allow_delete": null
}

Check for blocked indices:

GET _all/_settings?filter_path=**.blocks*

In the command output, a blocked index has the parameter:

"blocks": {
"read_only_allow_delete": "true"
}

4. Restoring Stuck Shards

After removing the block, check the cluster status again:

GET _cluster/health?pretty

Cluster recovery may take a significant amount of time.

If the status remains red or yellow, and the number of unassigned shards does not decrease or has reached a certain value and is not changing—this means that allocation attempts have failed.

Check unassigned shard status:

GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

Check allocation status:

GET _cluster/allocation/explain?pretty

To diagnose a shard of a specific index <index-name>:

GET _cluster/allocation/explain
{
"index": "<index-name>",
"shard": 0,
"primary": true
}

Initiate forced redistribution. Executes one attempt, can be repeated if necessary:

POST _cluster/reroute?retry_failed=true

5. Accelerating Recovery (Optional)

By default, nodes report disk usage every 30 seconds. To make the master node "see" that space has appeared faster and remove the restriction from the node in RAM, you can temporarily reduce the interval:

PUT _cluster/settings
{
"persistent": {
"cluster.info.update.interval": "15s"
}
}

Additionally, you can increase limits on simultaneous shard recoveries:

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.node_concurrent_incoming_recoveries": 4,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": 4,
"indices.recovery.max_bytes_per_sec": "150mb"
}
}

After completing recovery work, be sure to return to default values to avoid creating excessive load on the cluster:

PUT _cluster/settings
{
"transient": {
"cluster.info.update.interval": null,
"cluster.routing.allocation.node_concurrent_incoming_recoveries": null,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": null,
"indices.recovery.max_bytes_per_sec": null
}
}

Resetting disk watermark parameters to default values

If the configuration was unsuccessfully changed, you can always roll back to default values.

Since you use persistent, settings will be preserved even after cluster restart. However, if you previously changed them through transient, transient has higher priority—and resetting persistent to null won't have a visible effect while transient values are active.

PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
},
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
}
}

The cluster.routing.rebalance.enable parameter controls:

  • planned balancing — even distribution of shards across nodes for optimal performance
  • triggered when adding/removing nodes, changing node weights
  • does not affect emergency disk protection mechanisms

If cluster.routing.rebalance.enable: none is set, the system will still move shards from a full disk when the high watermark is exceeded. This parameter is not a way to prevent shard movement—to do that, you would need to change the thresholds themselves (watermark.high) or disable disk monitoring through cluster.routing.allocation.disk.threshold_enabled: false (which is categorically not recommended).