Disk Watermark Configuration
Overview
The Disk Watermark mechanism is a three-level protection system designed to:
- preventively limit data writing to nodes
- initiate shard redistribution
- completely block writes in critical situations
The mechanism's goal is to preserve the integrity of existing indices when disk space is exhausted.
Mechanism Architecture
The cluster.routing.allocation.disk.watermark.* settings are defined at the cluster level. This is a single, global rule stored in cluster settings (cluster state) and applied uniformly to all data nodes. You cannot set different fill percentages for different nodes using these parameters.
Checking and reaction occur independently on each node. Each node, upon receiving the unified cluster setting, continuously compares this value with the state of its local data disk. The decision to exceed the threshold and corresponding actions (refusing to place new shards, moving shards, transitioning to read-only) are made by the node individually based on its own metrics.
All requests are executed through the Developer Console (Main Menu - System Parameters - Developer Console), or via curl request.
curl -k -X PUT "https://<opensearch_host>:9200/_cluster/settings" \
-H 'Content-Type: application/json' \
-u 'admin:admin' \
-d '{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "80%",
"cluster.routing.allocation.disk.watermark.high": "85%",
"cluster.routing.allocation.disk.watermark.flood_stage": "90%"
}
}'
Three-Level Cluster Protection System Against Disk Overflow
- Level 1
- Level 2
- Level 3
Parameter: cluster.routing.allocation.disk.watermark.low
Default value: 85%
Behavior:
- node is excluded from candidates for new shard placement
- master node stops assigning new indices and replicas to this node
- existing writes continue
Purpose: Warning administrators about approaching critical disk fill levels.
Parameter: cluster.routing.allocation.disk.watermark.high
Default value: 90%
Behavior:
- master node initiates shard movement from overloaded nodes
- when Shard Allocation Awareness is enabled, failure zones are considered
Critical nuance: Movement is only possible for shards with replicas. Primary shards without replicas cannot be moved.
Risks: The relocation process consumes significant resources (IOPS, network, CPU)
Shard Allocation Awareness is a mechanism that ensures primary shards and their replicas are not located on the same physical device or failure zone.
How the relocation mechanism works when using Shard Allocation Awareness parameter.
If an administrator sets, for example, cluster.routing.allocation.awareness.attributes: zone, with node cluster settings configured as node.attr.zone: zoneA, node.attr.zone: zoneB, the cluster will ensure that copies of the same shard are located in different zones (zone-a and zone-b). When the high threshold is triggered on a node in zone-a, the cluster cannot simply move the replica to another node in the same zone-a. It will be forced to look for a node in zone-b that meets allocation conditions. This may slow down evacuation or make it temporarily impossible if there is insufficient space in zone-b.
Parameter: cluster.routing.allocation.disk.watermark.flood_stage
Default value: 95%
Behavior:
- all indices with shards on the affected node are switched to
read_only_allow_deletemode - operations requiring disk space are prohibited
- only data deletion is allowed
Purpose: Preventing damage to existing shards when disk space is exhausted.
The index.blocks.read_only_allow_delete block is set directly in the index metadata stored in the cluster state. As a result, the write restriction applies to all nodes containing shards of this index, regardless of the fill level of their local disks. Thus, even with replicas on healthy nodes that haven't reached threshold values, writing to the index will be completely stopped until the administrator forcibly removes the block.
Check current cluster settings:
GET _cluster/settings?include_defaults=true
Apply required disk watermark settings:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}
Risks of Using Percentage-Based Threshold Values
Using default percentage values (85%, 90%, 95%) involves certain risks related to the fact that relative values do not reflect the absolute volume of free space remaining available on the device when protective mechanisms are triggered. This discrepancy can lead to abnormal shard degradation before reaching the calculated threshold value.
Example:
A Search Anywhere Framework server with a partition capacity of 2 TB.
- with
flood_stagethreshold set to 95%, write blocking is activated only after the unallocated space volume decreases to 100 GB - meanwhile, a shard of 50 GB may be located on the node, for which a forced segment merge operation is planned. During this operation, the Apache Lucene library temporarily reserves disk space comparable to the size of the merged shard
- due to this discrepancy, there is a risk of emergency write termination with a
No space left on deviceerror and, consequently, shard damage before the disk fill sensor detects 95% and initiates protective blocking
Configuration Recommendation
For large-capacity partitions, it is advisable to use absolute threshold values (Mb, Gb). This approach guarantees the presence of a pre-calculated reserve of free space sufficient for correct execution of resource-intensive background operations on large shards.
Example configuration with absolute values:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "200gb",
"cluster.routing.allocation.disk.watermark.high": "100gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "50gb"
}
}
When setting absolute values, the following condition must be observed: the low threshold value should not exceed the physical capacity of the smallest disk device in the cluster. Failure to observe this condition will result in the node with the minimum disk size being excluded from the allocation process and completely stopping shard acceptance.
In a cluster with disks of different volumes (e.g., 500 GB and 3 TB), configuring thresholds in percentage terms can be critical: 10% of 500 GB is 50 GB, while 10% of 3 TB is 300 GB. Balancing will work unpredictably. It is recommended to use absolute values. Generally, in a Search Anywhere Framework cluster, it is recommended to use data nodes with identical disk storage sizes.
Difference between transient and persistent in Search Anywhere Framework Cluster Settings
When configuring a cluster, the following parameters can be used:
transient (temporary settings):
- applied immediately but not saved after cluster restart
- have priority over persistent: if the same parameter is set in both sections, the value from transient is used
- suitable for operational changes, testing, temporary problem resolution
- reset to default values or to persistent after node restarts
persistent (permanent settings):
- saved in the cluster index and preserved after node restarts
- act permanently until explicitly changed or reset (set to null)
- applied immediately but can be overridden by transient settings
Incident Response Actions
When flood_stage is triggered, the cluster switches indices to read-only mode. The main thing to consider when resolving the issue is that the block is not automatically removed when expanding the disk or deleting data.
1. Diagnostics
Check cluster status:
GET _cluster/health?pretty
Possible values for the status parameter:
green— all shards are distributedyellow— some replicas are not distributedred— some primary shards are not distributed, data is unavailable
Check disk and shard status on nodes:
GET _cat/allocation?v&h=node,disk.used_percent,disk.used,disk.avail,disk.total,shards
Example command output:
node disk.used disk.avail disk.total shards
smos-node-02 251gb 725gb 976gb 1953
smos-node-00 276.4gb 671.4gb 947.9gb 1952
Find indices occupying the most space:
GET _cat/indices?v&h=index,store.size&s=store.size:desc
If index data is not critical, they can be deleted:
DELETE /<index-name>
DELETE /sm_servers_hosts_uptime-000405
DELETE /sm_servers_hosts_uptime-2025*
2. Freeing Physical Space
The primary task is to eliminate the root cause: either physically expand disk space (via LVM, cloud disk) or clean up old indices/documents to reduce disk fill below the flood_stage threshold.
Temporary measure if there's no immediate possibility to expand the disk:
Raise the flood_stage threshold through transient settings (will reset after cluster restart):
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.flood_stage": "98%"
}
}
But with high data flow and large shard sizes, this won't help.
3. Forced Block Removal
Execute a request to remove the read_only_allow_delete flag from all cluster indices:
PUT */_settings
{
"index.blocks.read_only_allow_delete": null
}
To remove the block from a specific index:
PUT /<index-name>/_settings
{
"index.blocks.read_only_allow_delete": null
}
Check for blocked indices:
GET _all/_settings?filter_path=**.blocks*
In the command output, a blocked index has the parameter:
"blocks": {
"read_only_allow_delete": "true"
}
4. Restoring Stuck Shards
After removing the block, check the cluster status again:
GET _cluster/health?pretty
Cluster recovery may take a significant amount of time.
If the status remains red or yellow, and the number of unassigned shards does not decrease or has reached a certain value and is not changing—this means that allocation attempts have failed.
Check unassigned shard status:
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state
Check allocation status:
GET _cluster/allocation/explain?pretty
To diagnose a shard of a specific index <index-name>:
GET _cluster/allocation/explain
{
"index": "<index-name>",
"shard": 0,
"primary": true
}
Initiate forced redistribution. Executes one attempt, can be repeated if necessary:
POST _cluster/reroute?retry_failed=true
5. Accelerating Recovery (Optional)
By default, nodes report disk usage every 30 seconds. To make the master node "see" that space has appeared faster and remove the restriction from the node in RAM, you can temporarily reduce the interval:
PUT _cluster/settings
{
"persistent": {
"cluster.info.update.interval": "15s"
}
}
Additionally, you can increase limits on simultaneous shard recoveries:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.node_concurrent_incoming_recoveries": 4,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": 4,
"indices.recovery.max_bytes_per_sec": "150mb"
}
}
After completing recovery work, be sure to return to default values to avoid creating excessive load on the cluster:
PUT _cluster/settings
{
"transient": {
"cluster.info.update.interval": null,
"cluster.routing.allocation.node_concurrent_incoming_recoveries": null,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": null,
"indices.recovery.max_bytes_per_sec": null
}
}
Resetting disk watermark parameters to default values
If the configuration was unsuccessfully changed, you can always roll back to default values.
Since you use persistent, settings will be preserved even after cluster restart. However, if you previously changed them through transient, transient has higher priority—and resetting persistent to null won't have a visible effect while transient values are active.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
},
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null
}
}
How cluster.routing.rebalance.enable and watermark are related
The cluster.routing.rebalance.enable parameter controls:
- planned balancing — even distribution of shards across nodes for optimal performance
- triggered when adding/removing nodes, changing node weights
- does not affect emergency disk protection mechanisms
If cluster.routing.rebalance.enable: none is set, the system will still move shards from a full disk when the high watermark is exceeded. This parameter is not a way to prevent shard movement—to do that, you would need to change the thresholds themselves (watermark.high) or disable disk monitoring through cluster.routing.allocation.disk.threshold_enabled: false (which is categorically not recommended).