Data Storage Parameter Configuration
Basic Definitions
A shard is a subset of data in a database that is stored on a single server. In the context of a cluster, shards are distributed across different servers for parallel query processing. Data sharding is used when the total amount of data is too large for a single server or when it is necessary to scale the system to improve performance. Data is distributed across shards based on a specific key, such as user_id or timestamp.
A replica is a copy of data from one or more shards that is stored on a different server. Replication increases the fault tolerance of the system because the data is duplicated, and if one server fails, another can continue to operate. Replication is necessary to ensure high data availability and protect against data loss in case of hardware failure. The more replicas, the higher the availability, but this also requires more resources.
Table Parameter Settings
When using the ISM action Move to ClickHouse, tables are created automatically with basic settings. However, when creating tables manually, you need to define the data schema based on the index mapping in OpenSearch.
Shard Distribution
When manually creating tables in a ClickHouse cluster, you need to configure the distribution of data across shards. This allows you to effectively distribute the load on the cluster and ensure scalability. The ENGINE = Distributed
keyword is used for this, which distributes data across shards based on a specific key.
Replication
To configure replicated tables, you need to use the ReplicatedMergeTree
engine, which supports automatic data synchronization between replicas of the same shard.
Storage Time Control
To manage the storage time of data in ClickHouse tables, you can use the TTL (Time-To-Live) option. It allows you to automatically delete or move old data after a certain period.
Creating a Table with Sharding Support
To have a table utilize multiple shards during creation, you must use the Distributed
engine. Here's an example of creating a table distributed across shards:
CREATE TABLE my_table_sharded ON CLUSTER cluster_1S_2R (
id UInt32,
name String,
age UInt8
) ENGINE = Distributed(cluster_1S_2R, default, my_table, rand());
Description of Distributed Engine Parameters
cluster_1S_2R
- The name of the cluster where the table will be created. It must match the cluster name specified in the configuration file.default
- The name of the database where the local table is stored. In this case,default
is the standard ClickHouse database, but you can use any other database name if it is defined in the configuration.my_table
- The name of the local table stored on each cluster node. This table must be created beforehand on all cluster nodes. The distributed table (my_table_sharded
) will use this table to store data on each server.rand()
- The function that determines how data is distributed among the shards. In this case, therand()
function is used, which randomly distributes data among the shards.
Configuring Replicated Tables
Replicated tables in ClickHouse allow you to store copies of data on multiple nodes, ensuring their availability and fault tolerance. To create a replicated table, you must specify the ENGINE = ReplicatedMergeTree
parameters and set the paths to ClickHouse Keeper.
Example of a replicated table configuration:
CREATE TABLE example_table
(
id UInt32,
name String,
value Float32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/example_table', '{replica}')
ORDER BY id;
Description of ReplicatedMergeTree Parameters
/clickhouse/tables/{shard}/example_table
- The path to the table in ClickHouse Keeper. This parameter defines the path where the metadata of the replicated table is stored in ClickHouse Keeper.{shard}
- A variable that is automatically replaced with the identifier of the sharded node on which the table is created.example_table
- The name of the table. It must be the same for all replicas so that the system can correctly synchronize the data.{replica}
- A variable that is automatically replaced with the name or identifier of the specific node (replica) in the cluster.
Setting TTL (Time to live) for Data
Time to Live (TTL) in ClickHouse is a mechanism for automatically deleting or moving data after a specified time. TTL is set at the table or individual column level, specifying how long the data should be stored.
Creating a Table with TTL Parameters
Example of creating a table with TTL settings:
CREATE TABLE example_table
(
id UInt32,
name String,
timestamp DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY id
TTL timestamp + INTERVAL 1 MONTH
SETTINGS storage_policy = 'default';
TTL timestamp + INTERVAL 1 MONTH
- This expression defines the rule for automatic data deletion. All rows where the timestamp stored in the timestamp
column is older than 1 month from the current date will be automatically deleted.
Example of Setting TTL for an Existing Table
Applying TTL parameters to an existing table:
ALTER TABLE example_table
MODIFY TTL timestamp + INTERVAL 1 MONTH;
After applying TTL, you can check that the settings have been applied correctly:
SHOW CREATE TABLE example_table;
In the output of the command, you should see the TTL section with your settings.