Skip to main content
Version: 6.0

Reindex data

Reindexing is the process of transferring documents from one Search Anywhere Framework Data Storage index to another using the _reindex API operation. This operation is used when you need to change the data storage or processing structure without manually downloading and re-uploading documents.

Reindexing does not modify the source index and does not automatically transfer its settings, mappings, aliases and index template. It is recommended to create the target index in advance to explicitly fix the required schema and settings.

Terms

TermDescription
Source indexIndex from which documents are copied
Target indexNew index to which documents are written
AliasLogical index name. Allows switching the application to a new index without changing its configuration
Ingest pipelineSet of processors applied to documents when writing
SlicePart of the reindexing task. Used for parallel processing of large indexes
TaskBackground Search Anywhere Framework Data Storage task when _reindex is run asynchronously

When reindexing is required

  • changing field type in mapping
  • changing analyzers, normalizers or number of primary shard
  • renaming index
  • renaming, deleting or transforming fields
  • merging multiple indexes into one
  • copying part of data by filter
  • applying ingest pipeline to already existing documents

Some changes, for example adding a new field to mapping, can be performed without reindexing. Before starting the operation, it is necessary to check whether data transfer is really required.


Prerequisites

Before starting work, ensure that:

  • access to Dev Tools, curl or another HTTP client is available
  • the user has rights to read the source index and write to the target
  • there is enough space on the disk for temporary storage of two copies of data
  • the cluster is in green state or acceptable yellow for maintenance
  • a snapshot or other data recovery method has been created (if necessary)

General zero-downtime scheme

Recommended scenario for production using alias:

  1. Create a new index with a new version name, for example orders-v2
  2. Copy data from the current index via _reindex
  3. Check the number of documents, sample documents, mapping and search queries
  4. Stop or limit writing to the old index for a short time
  5. Resynchronize documents that changed during the main reindexing
  6. Atomically switch alias from old index to new
  7. Check the application
  8. Delete the old index only after observation period

Alias switching is performed with a single request to ensure that the alias points to exactly one index at any time:

POST _aliases
{
"actions": [
{ "remove": { "index": "old-index", "alias": "index-alias" } },
{ "add": { "index": "new-index", "alias": "index-alias" } }
]
}

Reindexing all documents

1. Creating the target index

First, you need to create the target index with the required field structure (mapping) and settings. They can be set manually or copied from the source index.

Performance recommendation

When creating temporary indexes, it is recommended to set the number of replicas (number_of_replicas) to 0 during reindexing. After completion, return the original value.

Note!

Do not rely on automatic index creation during _reindex - Search Anywhere Framework Data Storage will create it with dynamic mapping, which may not match the required schema.

PUT <index-name>
{
"mappings": {
... // Specify the required mapping
},
"settings": {
... // Specify the required settings
}
}

2. Performing the Reindex operation

For small indexes, synchronous launch is suitable:

POST _reindex
{
"source":{
"index":"source"
},
"dest":{
"index":"<index-name>"
}
}

For large indexes, asynchronous launch is required:

POST _reindex?wait_for_completion=false&slices=auto&requests_per_second=1000
{
"source": {
"index": "source",
"size": 1000
},
"dest": {
"index": "<index-name>",
"op_type": "create"
},
"conflicts": "proceed"
}

Launch parameters:

ParameterPurpose
wait_for_completion=falseRuns the operation in the background and returns task_id
slices=autoEnables automatic parallel execution
requests_per_secondLimits speed to reduce cluster load
source.sizeBatch size for reading documents
dest.op_type=createDoes not overwrite existing documents in the target index
conflicts=proceedContinues execution on version conflict or existing _id

Selective document reindexing

The _reindex operation allows copying not the entire index, but only documents matching a search query.

By condition

POST _reindex
{
"source":{
"index":"source",
"query": {
"match": {
"field_name": "text"
}
}
},
"dest":{
"index":"<index-name>"
}
}
Note!

The complete list of available operations is provided in the official OpenSearch documentation.

Only specific fields

POST _reindex
{
"source": {
"index": "source",
"_source": [
"field_1",
"field_2",
"field_3"
]
},
"dest": {
"index": "<index-name>"
}
}

Merging multiple indexes

To merge documents from multiple indexes into one, you need to specify the source indexes as a list.

POST _reindex
{
"source":{
"index":[
"source_1",
"source_2"
]
},
"dest":{
"index":"destination"
}
}
Note!

You need to ensure that the number of shards in the source and target indexes matches. Otherwise, the operation may fail.


Document transformation during reindexing

Method 1: Script

For simple transformations, the script section is used. The recommended language is Painless.

Example: renaming field client_id to customer_id

POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "<index-name>"
},
"script": {
"source": "ctx._source.customer_id = ctx._source.remove('client_id')"
}
}

Method 2: Ingest pipeline

For more complex transformations, ingest pipeline is used.

  1. First, you need to create a pipeline with the required processors:
PUT _ingest/pipeline/pipeline-test
{
"description": "Converts text field to list. Calculates the length of the 'word' field and saves it in the new field 'word_count'. Deletes the 'test' field",
"processors": [
{
"split": {
"field": "text",
"separator": "\\s+",
"target_field": "word"
}
},
{
"script": {
"lang": "painless",
"source": "ctx.word_count = ctx.word.length"
}
},
{
"remove": {
"field": "test"
}
}
]
}
  1. Then specify the pipeline in dest:
POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "<index-name>",
"pipeline": "pipeline-test"
}
}

Updating documents in the current index

To update data directly in the current index without creating a new one, the update_by_query operation is used.

Operation features:

  • executed with POST method
  • can work with only one index at a time
Command example
POST <index_name>/_update_by_query
Note!

If you run this command without parameters, it will increase the version number for all documents in the specified index.


Execution tracking

List of active reindexing tasks
GET _tasks?actions=*reindex*&detailed=true
Checking specific task
GET _tasks/<node_id>:<task_id>
Canceling task
POST _tasks/<node_id>:<task_id>/_cancel

Signs that load needs to be reduced:

  • search or write request latency grows
  • rejected requests appear in thread pool
  • JVM heap pressure grows
  • disk approaches flood-stage watermark
  • cluster stays in red or unstable yellow for a long time

In this case, you need to cancel the task and restart it with a lower requests_per_second or number of slices.


Rollback

If a problem is detected after switching, you need to return the alias to the old index:

POST _aliases
{
"actions": [
{ "remove": { "index": "new-index", "alias": "index-alias" } },
{ "add": { "index": "old-index", "alias": "index-alias" } }
]
}
Note!

If new entries have already appeared in the new index after switching, before rollback it is necessary to determine whether they need to be transferred back. Without this, the latest changes may be lost at the application level.


Limitations and risks

  • _reindex copies documents but does not automatically transfer index settings, templates and alias
  • changes in the source index during the operation are not blocked
  • with active writing, a resynchronization strategy is required
  • large reindexing creates load on disk I/O, CPU, heap and thread pools
  • with insufficient disk space, the index may become read-only due to watermark
  • scripts and pipeline can slow down the operation
  • mapping errors in the target index will result in rejected documents
  • conflicts=proceed skips conflicts but does not fix their cause

Source index parameters

ParameterValid valuesDescriptionRequired
indexStringSource index name. Multiple indexes can be specified as a list.YES
max_docsIntegerMaximum number of documents to reindex.NO
queryObjectSearch query to select documents for reindexing operation.NO
sizeIntegerNumber of documents to reindex.NO
sliceStringSets manual or automatic parallelization (slicing) to speed up the reindexing process.NO

Target index parameters

ParameterValid valuesDescriptionRequired
indexStringTarget index name.YES
version_typeEnumVersion control type for indexing operation. Valid values: internal, external, external_gt, external_gte.NO

Checklist before launch

  • snapshot created or other recovery method confirmed (if necessary)
  • cluster state checked
  • target index created with required settings and mappings
  • test reindexing performed
  • disk space estimated
  • strategy for handling records during migration determined
  • rollback prepared via alias

Checklist after launch

  • _reindex task completed without critical errors
  • number of documents verified
  • sample documents checked
  • production index settings returned
  • alias switched with one atomic request
  • application checked after switching
  • old index left for observation period