Reindex data
Reindexing is the process of transferring documents from one Search Anywhere Framework Data Storage index to another using the _reindex API operation.
This operation is used when you need to change the data storage or processing structure without manually downloading and re-uploading documents.
Reindexing does not modify the source index and does not automatically transfer its settings, mappings, aliases and index template. It is recommended to create the target index in advance to explicitly fix the required schema and settings.
Terms
| Term | Description |
|---|---|
| Source index | Index from which documents are copied |
| Target index | New index to which documents are written |
| Alias | Logical index name. Allows switching the application to a new index without changing its configuration |
| Ingest pipeline | Set of processors applied to documents when writing |
| Slice | Part of the reindexing task. Used for parallel processing of large indexes |
| Task | Background Search Anywhere Framework Data Storage task when _reindex is run asynchronously |
When reindexing is required
- changing field type in
mapping - changing analyzers, normalizers or number of
primary shard - renaming index
- renaming, deleting or transforming fields
- merging multiple indexes into one
- copying part of data by filter
- applying
ingest pipelineto already existing documents
Some changes, for example adding a new field to mapping, can be performed without reindexing. Before starting the operation, it is necessary to check whether data transfer is really required.
Prerequisites
Before starting work, ensure that:
- access to Dev Tools,
curlor another HTTP client is available - the user has rights to read the source index and write to the target
- there is enough space on the disk for temporary storage of two copies of data
- the cluster is in
greenstate or acceptableyellowfor maintenance - a snapshot or other data recovery method has been created (if necessary)
General zero-downtime scheme
Recommended scenario for production using alias:
- Create a new index with a new version name, for example
orders-v2 - Copy data from the current index via
_reindex - Check the number of documents, sample documents, mapping and search queries
- Stop or limit writing to the old index for a short time
- Resynchronize documents that changed during the main reindexing
- Atomically switch alias from old index to new
- Check the application
- Delete the old index only after observation period
Alias switching is performed with a single request to ensure that the alias points to exactly one index at any time:
POST _aliases
{
"actions": [
{ "remove": { "index": "old-index", "alias": "index-alias" } },
{ "add": { "index": "new-index", "alias": "index-alias" } }
]
}
Reindexing all documents
1. Creating the target index
First, you need to create the target index with the required field structure (mapping) and settings. They can be set manually or copied from the source index.
When creating temporary indexes, it is recommended to set the number of replicas (number_of_replicas) to 0 during reindexing. After completion, return the original value.
Do not rely on automatic index creation during _reindex - Search Anywhere Framework Data Storage will create it with dynamic mapping, which may not match the required schema.
PUT <index-name>
{
"mappings": {
... // Specify the required mapping
},
"settings": {
... // Specify the required settings
}
}
2. Performing the Reindex operation
For small indexes, synchronous launch is suitable:
POST _reindex
{
"source":{
"index":"source"
},
"dest":{
"index":"<index-name>"
}
}
For large indexes, asynchronous launch is required:
POST _reindex?wait_for_completion=false&slices=auto&requests_per_second=1000
{
"source": {
"index": "source",
"size": 1000
},
"dest": {
"index": "<index-name>",
"op_type": "create"
},
"conflicts": "proceed"
}
Launch parameters:
| Parameter | Purpose |
|---|---|
wait_for_completion=false | Runs the operation in the background and returns task_id |
slices=auto | Enables automatic parallel execution |
requests_per_second | Limits speed to reduce cluster load |
source.size | Batch size for reading documents |
dest.op_type=create | Does not overwrite existing documents in the target index |
conflicts=proceed | Continues execution on version conflict or existing _id |
Selective document reindexing
The _reindex operation allows copying not the entire index, but only documents matching a search query.
By condition
POST _reindex
{
"source":{
"index":"source",
"query": {
"match": {
"field_name": "text"
}
}
},
"dest":{
"index":"<index-name>"
}
}
The complete list of available operations is provided in the official OpenSearch documentation.
Only specific fields
POST _reindex
{
"source": {
"index": "source",
"_source": [
"field_1",
"field_2",
"field_3"
]
},
"dest": {
"index": "<index-name>"
}
}
Merging multiple indexes
To merge documents from multiple indexes into one, you need to specify the source indexes as a list.
POST _reindex
{
"source":{
"index":[
"source_1",
"source_2"
]
},
"dest":{
"index":"destination"
}
}
You need to ensure that the number of shards in the source and target indexes matches. Otherwise, the operation may fail.
Document transformation during reindexing
Method 1: Script
For simple transformations, the script section is used. The recommended language is Painless.
Example: renaming field client_id to customer_id
POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "<index-name>"
},
"script": {
"source": "ctx._source.customer_id = ctx._source.remove('client_id')"
}
}
Method 2: Ingest pipeline
For more complex transformations, ingest pipeline is used.
- First, you need to create a pipeline with the required processors:
PUT _ingest/pipeline/pipeline-test
{
"description": "Converts text field to list. Calculates the length of the 'word' field and saves it in the new field 'word_count'. Deletes the 'test' field",
"processors": [
{
"split": {
"field": "text",
"separator": "\\s+",
"target_field": "word"
}
},
{
"script": {
"lang": "painless",
"source": "ctx.word_count = ctx.word.length"
}
},
{
"remove": {
"field": "test"
}
}
]
}
- Then specify the pipeline in
dest:
POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "<index-name>",
"pipeline": "pipeline-test"
}
}
Updating documents in the current index
To update data directly in the current index without creating a new one, the update_by_query operation is used.
Operation features:
- executed with
POSTmethod - can work with only one index at a time
POST <index_name>/_update_by_query
If you run this command without parameters, it will increase the version number for all documents in the specified index.
Execution tracking
GET _tasks?actions=*reindex*&detailed=true
GET _tasks/<node_id>:<task_id>
POST _tasks/<node_id>:<task_id>/_cancel
Signs that load needs to be reduced:
- search or write request latency grows
- rejected requests appear in thread pool
- JVM heap pressure grows
- disk approaches flood-stage watermark
- cluster stays in
redor unstableyellowfor a long time
In this case, you need to cancel the task and restart it with a lower requests_per_second or number of slices.
Rollback
If a problem is detected after switching, you need to return the alias to the old index:
POST _aliases
{
"actions": [
{ "remove": { "index": "new-index", "alias": "index-alias" } },
{ "add": { "index": "old-index", "alias": "index-alias" } }
]
}
If new entries have already appeared in the new index after switching, before rollback it is necessary to determine whether they need to be transferred back. Without this, the latest changes may be lost at the application level.
Limitations and risks
_reindexcopies documents but does not automatically transfer index settings, templates and alias- changes in the source index during the operation are not blocked
- with active writing, a resynchronization strategy is required
- large reindexing creates load on disk I/O, CPU, heap and thread pools
- with insufficient disk space, the index may become read-only due to watermark
- scripts and pipeline can slow down the operation
- mapping errors in the target index will result in rejected documents
conflicts=proceedskips conflicts but does not fix their cause
Source index parameters
| Parameter | Valid values | Description | Required |
|---|---|---|---|
index | String | Source index name. Multiple indexes can be specified as a list. | YES |
max_docs | Integer | Maximum number of documents to reindex. | NO |
query | Object | Search query to select documents for reindexing operation. | NO |
size | Integer | Number of documents to reindex. | NO |
slice | String | Sets manual or automatic parallelization (slicing) to speed up the reindexing process. | NO |
Target index parameters
| Parameter | Valid values | Description | Required |
|---|---|---|---|
index | String | Target index name. | YES |
version_type | Enum | Version control type for indexing operation. Valid values: internal, external, external_gt, external_gte. | NO |
Checklist before launch
- snapshot created or other recovery method confirmed (if necessary)
- cluster state checked
- target index created with required
settingsandmappings - test reindexing performed
- disk space estimated
- strategy for handling records during migration determined
- rollback prepared via alias
Checklist after launch
_reindextask completed without critical errors- number of documents verified
- sample documents checked
- production index settings returned
- alias switched with one atomic request
- application checked after switching
- old index left for observation period