Table Engines
Table engines (i.e., table types) determine:
- How and where data is stored, where data is written to and read from
- What queries are supported and how they are supported
- Concurrent data access
- Use of indexes (if present)
- Whether multi-threaded requests can be executed
- Data replication parameters
The cloud data warehouse UClickHouse supports table engines grouped into four major categories: MergeTree, Log, Integrations, and Special.
See the table below:
Series | Description | Table Engine | Features |
---|---|---|---|
MergeTree | MergeTree series engines are suitable for tasks under heavy load, supporting fast writes of large volumes of data and subsequent data processing. They are highly versatile and powerful. The common characteristics of this series include support for data replication, partitioning, and data sampling. | MergeTree | Used for inserting massive amounts of data into a table, data is quickly written in data fragments one after the other, and these fragments are merged according to certain rules. |
ReplacingMergeTree | Used to solve the issue of duplicate primary keys in MergeTree engines, it can delete duplicates with the same primary key value. | ||
CollapsingMergeTree | Adds a marker column Sign in the table creation statement to address limitations of ReplacingMergeTree.
|
||
VersionedCollapsingMergeTree | Adds a Version column in the table creation statement to address out-of-order writes, preventing normal collapsing (deletion) issues. |
||
SummingMergeTree | Used for pre-aggregating primary key columns, merging all rows with the same primary key into one, significantly reducing storage space and improving aggregation performance. | ||
AggregatingMergeTree | A type of pre-aggregating engine meant to enhance aggregation performance, allowing various aggregation functions to be specified. | ||
GraphiteMergeTree | Used for storing and summarizing Graphite data, it can reduce storage space and improve the query efficiency of Graphite data. | ||
Log | Log series engines are suitable for scenarios where data in small tables (around 1 million rows) is quickly written and entirely read later. Common characteristics of these engines include:
|
TinyLog | Does not support concurrent read of data files, has a simple format, poor query performance, and is suitable for temporarily storing intermediate data. |
StripeLog | Supports concurrent reading of data files, stores all columns in one large file thereby reducing file numbers, and offers better query performance compared to TinyLog. | ||
Log | Supports concurrent reading of data files, each column is stored in a separate file, offering better query performance than TinyLog. | ||
Integrations | Integration series engines are suitable for importing external data into the UClickHouse cloud data warehouse, or directly using external data sources within UClickHouse. | Kafka | Directly imports data from Kafka Topics into the UClickHouse cloud data warehouse. |
MySQL | Uses MySQL as the storage engine, allowing select and other operations on MySQL tables directly within UClickHouse. |
||
JDBC | Reads data sources through a specified JDBC connection string. | ||
ODBC | Reads data sources through a specified ODBC connection string. | ||
HDFS | Directly reads data files in specific formats from HDFS. | ||
Special | Special series engines are suitable for specific functional scenarios. | Distributed | Does not store data itself but supports distributed querying on multiple servers. |
MaterializedView | Used for creating materialized views. | ||
Dictionary | Displays dictionary data as a UClickHouse table. | ||
Merge | Does not store data itself but can read data from any number of other tables simultaneously. | ||
File | Directly uses local files as data storage. | ||
NULL | Data written is discarded, read data is empty. | ||
Set | Data is always stored in RAM. | ||
Join | Data is always stored in memory. | ||
URL | Used for managing data on remote HTTP/HTTPS servers. | ||
View | Does not store data itself, only stores the specified SELECT query. |
||
Memory | Data is stored in memory, causing data loss upon restart. It offers excellent query performance and is suitable for small tables (less than 100 million), requiring no data persistence. In UClickHouse, it is commonly used for temporary tables. | ||
Buffer | Sets a memory buffer for the target table, which is written to disk once it reaches certain conditions. |
For more detailed information on table engines, please refer to the official documentation Table Engine Documentation .
MergeTree
The MergeTree table engine is primarily used for massive data analysis, supporting data partitioning, ordered storage, primary key indexing, sparse indexing, and data TTL, among others. The MergeTree table engine supports all SQL syntax in the UClickHouse cloud data warehouse; however, some features may differ from standard SQL. In the MergeTree table engine, its primary function is to accelerate queries, and even after Compaction, rows with the same primary key remain co-existing.
This type of engine includes:
- MergeTree
- ReplacingMergeTree
- SummingMergeTree
- AggregatingMergeTree
- CollapsingMergeTree
- VersionedCollapsingMergeTree
- GraphiteMergeTree
For more detailed information on the table engine, refer to the official documentation MergeTree .
Example:
-
Create a table
lineorder
, specify MergeTree, partition byLO_ORDERDATE
, and sort byLO_ORDERDATE, LO_ORDERKEY
.CREATE TABLE lineorder_local on cluster ck_cluster ( LO_ORDERKEY UInt32, LO_LINENUMBER UInt8, LO_CUSTKEY UInt32, LO_PARTKEY UInt32, LO_SUPPKEY UInt32, LO_ORDERDATE Date, LO_ORDERPRIORITY LowCardinality(String), LO_SHIPPRIORITY UInt8, LO_QUANTITY UInt8, LO_EXTENDEDPRICE UInt32, LO_ORDTOTALPRICE UInt32, LO_DISCOUNT UInt8, LO_REVENUE UInt32, LO_SUPPLYCOST UInt32, LO_TAX UInt8, LO_COMMITDATE Date, LO_SHIPMODE LowCardinality(String) ) ENGINE = MergeTree() PRIMARY KEY (LO_ORDERDATE, LO_ORDERKEY) PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
-
Write data with duplicate primary keys.
Refer to Quick Start -> Prepare Data and Import
Prepare the data and save it as a lineorder.csv file:
1,1,7381,155190,828,"1996-01-02","5-LOW",0,17,2116823,17366547,4,2032150,74711,2,"1996-02-12","TRUCK", 1,2,7381,67310,163,"1996-01-02","5-LOW",0,36,4598316,17366547,9,4184467,76638,6,"1996-02-28","MAIL", 1,3,7381,63700,71,"1996-01-02","5-LOW",0,8,1330960,17366547,10,1197864,99822,2,"1996-03-05","REG AIR", 1,4,7381,2132,943,"1996-01-02","5-LOW",0,28,2895564,17366547,9,2634963,62047,6,"1996-03-30","AIR", 1,5,7381,24027,1625,"1996-01-02","5-LOW",0,24,2282448,17366547,10,2054203,57061,4,"1996-03-14","FOB", 1,6,7381,15635,1368,"1996-01-02","5-LOW",0,32,4962016,17366547,7,4614674,93037,2,"1996-02-07","MAIL", 2,1,15601,106170,1066,"1996-12-01","1-URGENT",0,38,4469446,4692918,0,4469446,70570,5,"1997-01-14","RAIL", 3,1,24664,4297,1959,"1993-10-14","5-LOW",0,45,5405805,19384625,6,5081456,72077,0,"1994-01-04","AIR", 3,2,24664,19036,1667,"1993-10-14","5-LOW",0,49,4679647,19384625,10,4211682,57301,0,"1993-12-20","RAIL", 3,3,24664,128449,1409,"1993-10-14","5-LOW",0,27,3989088,19384625,6,3749742,88646,7,"1993-11-22","SHIP",
Import data:
clickhouse-client --host=<node IP> --port=9000 --user=admin --password=<password set at cluster creation> --database=<database name> --query="INSERT INTO lineorder_local FORMAT CSV" < /data/lineorder.csv
-
Query data.
select * from lineorder_local;
The query results are as follows:
-
Since the MergeTree series table engine uses a structure similar to LSM Tree, many storage layer processing logics will only happen during Compaction, so you need to execute the optimize statement to force background Compaction.
optimize table lineorder_local final;
-
Query the data again.
select * from lineorder_local;
The query results are as follows, and the data with duplicate primary keys still exists.
ReplacingMergeTree
To solve the issue of duplicate primary keys in the MergeTree table engine, UClickHouse provides the ReplacingMergeTree table engine, which deletes duplicate entries with the same primary key value.
Although the ReplacingMergeTree table engine provides the ability to deduplicate primary keys, there are still many limitations, so the ReplacingMergeTree table engine is more often used to ensure that data is eventually deduplicated and cannot guarantee that the primary key will not be duplicated during queries. The main limitations are as follows.
- In distributed scenarios, data with the same primary key may be distributed to different nodes and deduplication may not be possible across different shards.
- Before a complete optimize operation, primary key deduplication might not be effective, for example, some data may be deduplicated but another part may still have duplicate primary keys.
- Optimize is a background task, its exact execution time cannot be predicted.
- Manually executing optimize in massive data scenarios requires a significant amount of time, failing to meet instant business query needs.
For more information on the ReplacingMergeTree table engine, please refer to the official documentation ReplacingMergeTree
Example:
-
Create table
lineorder_local_replacing
.CREATE TABLE lineorder_local_replacing on cluster ck_cluster ( LO_ORDERKEY UInt32, LO_LINENUMBER UInt8, LO_CUSTKEY UInt32, LO_PARTKEY UInt32, LO_SUPPKEY UInt32, LO_ORDERDATE Date, LO_ORDERPRIORITY LowCardinality(String), LO_SHIPPRIORITY UInt8, LO_QUANTITY UInt8, LO_EXTENDEDPRICE UInt32, LO_ORDTOTALPRICE UInt32, LO_DISCOUNT UInt8, LO_REVENUE UInt32, LO_SUPPLYCOST UInt32, LO_TAX UInt8, LO_COMMITDATE Date, LO_SHIPMODE LowCardinality(String) ) ENGINE = ReplacingMergeTree() PRIMARY KEY (LO_ORDERDATE, LO_ORDERKEY) PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
-
Write data with duplicate primary keys.
Refer to Quick Start -> Prepare Data and Import
Prepare the data and save it as a lineorder.csv file:
1,1,7381,155190,828,"1996-01-02","5-LOW",0,17,2116823,17366547,4,2032150,74711,2,"1996-02-12","TRUCK", 1,2,7381,67310,163,"1996-01-02","5-LOW",0,36,4598316,17366547,9,4184467,76638,6,"1996-02-28","MAIL", 1,3,7381,63700,71,"1996-01-02","5-LOW",0,8,1330960,17366547,10,1197864,99822,2,"1996-03-05","REG AIR", 1,4,7381,2132,943,"1996-01-02","5-LOW",0,28,2895564,17366547,9,2634963,62047,6,"1996-03-30","AIR", 1,5,7381,24027,1625,"1996-01-02","5-LOW",0,24,2282448,17366547,10,2054203,57061,4,"1996-03-14","FOB", 1,6,7381,15635,1368,"1996-01-02","5-LOW",0,32,4962016,17366547,7,4614674,93037,2,"1996-02-07","MAIL", 2,1,15601,106170,1066,"1996-12-01","1-URGENT",0,38,4469446,4692918,0,4469446,70570,5,"1997-01-14","RAIL", 3,1,24664,4297,1959,"1993-10-14","5-LOW",0,45,5405805,19384625,6,5081456,72077,0,"1994-01-04","AIR", 3,2,24664,19036,1667,"1993-10-14","5-LOW",0,49,4679647,19384625,10,4211682,57301,0,"1993-12-20","RAIL", 3,3,24664,128449,1409,"1993-10-14","5-LOW",0,27,3989088,19384625,6,3749742,88646,7,"1993-11-22","SHIP",
Import data:
clickhouse-client --host=<node IP> --port=9000 --user=admin --password=<password set at cluster creation> --database=<database name> --query="INSERT INTO lineorder_local_replacing FORMAT CSV" < /data/lineorder.csv
-
Query data.
select * from lineorder_local_replacing;
The query results are as follows, finding the duplicate data has been removed:
CollapsingMergeTree
The CollapsingMergeTree table engine is used to eliminate the functional limitations of the ReplacingMergeTree table engine. This table engine requires specifying a marker column Sign
in the table creation statement, dividing rows into two types based on the value of Sign: rows with Sign=1
are called state rows used for adding states, while rows with Sign=-1
are called cancel rows used for deleting states.
Note:
Even though the CollapsingMergeTree table engine solves the problem of immediate deletion of data with the same primary key, the position of state rows and cancel rows may get disordered in cases where states are constantly changing, and writes occur in multi-threaded parallel mode, leading to unsuccessful collapsing (deletion).
During background Compaction, rows with the same primary key but opposite Sign
are collapsed (deleted), while data that have not undergone Compaction still coexist with both state and cancel rows. Therefore, to achieve the goal of primary key collapse (deletion), the business layer needs to perform the following operations.
-
Record the original value of the state row, or query the database to get the original state row value before executing the delete state operation.
Specific reason: When executing the delete state operation, a cancel row needs to be inserted, and the cancel row must contain data identical to the original state row (except for the Sign column).
-
In aggregate calculations like
count()
andsum(col)
, duplicated data might occur. To get correct results, the business layer needs to rewrite the SQL, changingcount()
andsum(col)
tosum(Sign)
andsum(col * Sign)
respectively.The specific reasons are as follows.
- The Compaction timing is unpredictable. When querying, state rows and cancel rows may not have been collapsed (deleted).
- The cloud database ClickHouse cannot ensure that rows with the same primary key reside on the same node, and data not on the same node cannot be collapsed (deleted).
For more information on the CollapsingMergeTree table engine, please refer to the official documentation CollapsingMergeTree .
Example:
-
Create table
lineorder_local_collapsing
.CREATE TABLE lineorder_local_collapsing ( LO_ORDERKEY UInt64, LO_LINENUMBER UInt8, LO_CUSTKEY UInt8, Sign Int8 ) ENGINE = CollapsingMergeTree(Sign) ORDER BY LO_ORDERKEY;
-
Insert state row
Sign=1
.INSERT INTO lineorder_local_collapsing VALUES (1, 5, 146, 1);
Note: If you insert a cancel row first and then a state row, it may lead to disordered positions, making it impossible to collapse (delete) the primary key even after forcing background Compaction.
-
Insert cancel row
Sign=-1
, with values identical to the inserted state row (exceptSign
). Meanwhile, insert a new state row with the same primary key.INSERT INTO lineorder_local_collapsing VALUES (1, 5, 146, -1), (2, 6, 185, 1);
-
Query data.
SELECT * FROM lineorder_local_collapsing;
The query result is as follows:
-
If you need to aggregate on a specific column, take
sum(col)
as an example. To get correct results, the SQL statement needs to be rewritten as follows.SELECT LO_ORDERKEY, sum(LO_LINENUMBER * Sign) AS LO_LINENUMBERS, sum(LO_CUSTKEY * Sign) AS LO_CUSTKEYS FROM lineorder_local_collapsing GROUP BY LO_ORDERKEY HAVING sum(Sign) > 0;
After execution, the result is as follows:
-
Since the MergeTree series table engine uses a structure similar to LSM Tree, many storage layer processing logics will only happen during Compaction, so you need to execute the optimize statement to force background Compaction.
optimize table lineorder_local_collapsing final;
-
Query the data again.
SELECT * FROM lineorder_local_collapsing;
The query result is as follows:
VersionedCollapsingMergeTree
To solve the problem of disordered writes in CollapsingMergeTree table engines causing unsuccessful collapsing (deletion), the ClickHouse cloud database provides the VersionedCollapsingMergeTree table engine, which adds a column Version
in the table creation statement to record the relationship between state rows and cancel rows in disordered scenarios. During background Compaction, rows with the same primary key, same Version
, and opposite Sign
are collapsed (deleted). Similar to the CollapsingMergeTree table engine, when performing aggregate calculations like count()
and sum(col)
, the business layer needs to rewrite the SQL, changing count()
and sum(col)
to sum(Sign)
and sum(col * Sign)
respectively.
Note:
For more information on the VersionedCollapsingMergeTree table engine, please refer to the official documentation VersionedCollapsingMergeTree .
Example:
-
Create
lineorder_local_versioned
CREATE TABLE lineorder_local_versioned ( LO_ORDERKEY UInt64, LO_LINENUMBER UInt8, LO_CUSTKEY UInt8, Sign Int8, Version UInt8 ) ENGINE = VersionedCollapsingMergeTree(Sign, Version) ORDER BY LO_ORDERKEY;
-
Insert cancel row
Sign=-1
.INSERT INTO lineorder_local_versioned VALUES (1, 3, 222, -1, 1);
-
Insert state row
Sign=1
,Version=1
, with other column values identical to the inserted cancel row. Meanwhile, insert a new state row with the same primary key.INSERT INTO lineorder_local_versioned VALUES (1, 3, 222, 1, 1),(2, 6, 345, 1, 2);
-
Query data.
SELECT * FROM lineorder_local_versioned;
The query result is as follows:
-
If you need to aggregate on a specific column, take
sum(col)
as an example. To get correct results, the SQL statement needs to be rewritten as follows.SELECT LO_ORDERKEY, sum(LO_LINENUMBER * Sign) AS LO_LINENUMBERS, sum(LO_CUSTKEY * Sign) AS LO_CUSTKEYS FROM lineorder_local_versioned GROUP BY LO_ORDERKEY HAVING sum(Sign) > 0;
After execution, the query result is as follows:
-
Since the MergeTree series table engine uses a structure similar to LSM Tree, many storage layer processing logics will only happen during Compaction, so you need to execute the optimize statement to force background Compaction.
optimize table lineorder_local_versioned final;
-
Query the data again.
SELECT * FROM lineorder_local_versioned;
The query result is as follows:
SummingMergeTree
The SummingMergeTree table engine is used for pre-aggregating primary key columns by merging all rows with the same primary key into a single row, significantly reducing storage space usage and improving aggregation computation performance.
When using the SummingMergeTree table engine, please note the following points:
- UClickHouse only pre-aggregates primary key columns during background Compaction, and the timing of Compaction execution cannot be predicted, so there may be situations where some data has been pre-aggregated, and some data has not. Thus,
GROUP BY
clauses still need to be used when performing aggregation calculations. - During pre-aggregation, UClickHouse pre-aggregates all columns other than the primary key columns. If these columns are aggregatable (e.g., numeric types),
sum
is directly applied. If non-aggregatable (e.g., String types), a random value will be chosen. - It is usually recommended to use SummingMergeTree table engines in combination with MergeTree table engines. MergeTree table engines store complete data, and SummingMergeTree table engines store pre-aggregated results.
Note:
For more information on the SummingMergeTree table engine, please refer to the official documentation SummingMergeTree .
Example:
-
Create table
lineorder_local_summing
.CREATE TABLE lineorder_local_summing ( key UInt32, value UInt32 ) ENGINE = SummingMergeTree() ORDER BY key;
-
Write data.
INSERT INTO lineorder_local_summing Values(1,1),(1,2),(2,1);
-
Query data.
select * from lineorder_local_summing;
The query result is as follows:
AggregatingMergeTree
The AggregatingMergeTree table engine is also a type of pre-aggregating engine used to enhance the performance of aggregation calculations. The difference from SummingMergeTree is that SummingMergeTree is used to sum
aggregate non-primary key columns, whereas AggregatingMergeTree allows specifying various aggregation functions. The syntax for AggregatingMergeTree is more complex and needs to be used in conjunction with materialized views or the special data type AggregateFunction
in the ClickHouse cloud database.
Note:
For more information on the AggregatingMergeTree table engine, please see the official documentation AggregatingMergeTree .
Example:
- In Conjunction with Materialized Views
-
Create
lineorder
.CREATE TABLE lineorder ( LO_ORDERKEY UInt64, LO_LINENUMBER UInt8, LO_ORDERDATE Date, SIGN Int8 ) ENGINE = CollapsingMergeTree(SIGN) ORDER BY LO_ORDERKEY;
-
Create a materialized view
lineorder_view
forlineorder
, and usesumState
anduniqState
functions for pre-aggregating the detail table.CREATE MATERIALIZED VIEW lineorder_view ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(LO_ORDERDATE) ORDER BY (LO_LINENUMBER, LO_ORDERDATE) AS SELECT LO_LINENUMBER, LO_ORDERDATE, sumState(SIGN) AS SIGNS, uniqState(LO_ORDERKEY) AS ORDERKEYS FROM lineorder GROUP BY LO_LINENUMBER, LO_ORDERDATE;
-
Write data to
lineorder
.INSERT INTO lineorder VALUES(0, 0, '2019-11-11', 1); INSERT INTO lineorder VALUES(1, 1, '2020-11-12', 1);
-
Use the aggregation functions
sumMerge
anduniqMerge
for aggregating the materialized view and query the aggregate data.SELECT LO_ORDERDATE, sumMerge(SIGNS) AS SIGNS, uniqMerge(ORDERKEYS) AS ORDERKEYS FROM lineorder_view GROUP BY LO_ORDERDATE ORDER BY LO_ORDERDATE
Note: Functions
sum
anduniq
cannot be used anymore; otherwise, the SQL error will be: Illegal type AggregateFunction(sum, Int8) of argument for aggregate function sum…The query results are as follows:
- In Conjunction with the Special Data Type AggregateFunction
-
Create
lineorder
.CREATE TABLE lineorder ( LO_ORDERKEY UInt8, LO_ORDERDATE Date, LO_LINENUMBER UInt64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(LO_ORDERDATE) ORDER BY (LO_ORDERKEY, LO_ORDERDATE);
-
Write data to the detail table
lineorder
.INSERT INTO lineorder VALUES(0, '2019-11-11', 1); INSERT INTO lineorder VALUES(1, '2020-11-12', 1);
-
Create the aggregation table
lineorder_agg
, where theLO_LINENUMBER
column’s type is AggregateFunctionCREATE TABLE lineorder_agg ( LO_ORDERKEY UInt8, LO_ORDERDATE Date, LO_LINENUMBER AggregateFunction(uniq, UInt64) ) ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(LO_ORDERDATE) ORDER BY (LO_ORDERKEY, LO_ORDERDATE);
-
Use the aggregation function
uniqState
to insert data from the detail table into the aggregation table.INSERT INTO lineorder_agg select LO_ORDERKEY, LO_ORDERDATE, uniqState(LO_LINENUMBER) from lineorder group by LO_ORDERKEY, LO_ORDERDATE;
Note: You cannot use
INSERT INTO lineorder_agg VALUES(1, '2019-11-12', 1);
to insert data into the aggregation table, or the SQL error will be: Cannot convert UInt64 to AggregateFunction(uniq, UInt64)… -
Use the aggregation function
uniqMerge
for aggregating the aggregation table and querying the aggregate data.SELECT uniqMerge(LO_LINENUMBER) AS LINENUMBER FROM lineorder_agg GROUP BY LO_ORDERKEY, LO_ORDERDATE;
The query results are as follows:
Log
Lightweight engines with minimal functionality. This type of engine is best when you need to quickly write many small tables (up to about 1 million rows) and read them later as a whole.
This type of engine includes:
- TinyLog
- StripeLog
- Log
Integration Engines
These engines are for integrating with other data storage and processing systems.
This type of engine includes:
- Kafka
- MySQL
- ODBC
- JDBC
- HDFS
Engines for Other Specific Functions
This type of engine includes:
- Distributed
- MaterializedView
- Dictionary
- Merge
- File
- Null
- Set
- Join
- URL
- View
- Memory
- Buffer
Virtual Columns
- Virtual columns are a part of the table engine definition included in the corresponding table engine source code.
- Virtual columns cannot be specified in CREATE TABLE and will not be included in the results of the SHOW CREATE TABLE and DESCRIBE TABLE queries. Virtual columns are read-only, so data cannot be written into them.
- When querying data in virtual columns, the virtual column name must be included in the SELECT query.
SELECT *
will not return data from virtual columns. - If there is a column with the same name as a virtual column in the created table, the virtual column will no longer be accessible. To avoid such column name conflicts, virtual column names are typically prefixed with an underscore.