Query (SELECT)

Basic Syntax:


SELECT [DISTINCT] expr_list
    [FROM [db.]table | (subquery) | table_function] [FINAL]
    [SAMPLE sample_coeff]
    [ARRAY JOIN ...]
    [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list
    [PREWHERE expr]
    [WHERE expr]
    [GROUP BY expr_list] [WITH TOTALS]
    [HAVING expr]
    [ORDER BY expr_list]
    [LIMIT [n, ]m]
    [UNION ALL ...]
    [INTO OUTFILE filename]
    [FORMAT format]
    [LIMIT n BY columns]

All clauses are optional except for the expression list (expr_list) after SELECT. Partial explanations of some clauses are provided below. For more details, please refer to the query syntax in the ClickHouse official documentation.

Example of a simple query statement:


SELECT SUM(LO_REVENUE) AS REVENUE
FROM lineorder
WHERE ((LO_DISCOUNT >= 1) AND (LO_DISCOUNT <= 3)) AND (LO_QUANTITY < 25)

SAMPLE Clause

The SAMPLE clause allows users to perform approximate query processing, which only works on tables of the MergeTree* type and requires a sampling expression to be specified when creating the table.

The SAMPLE clause can be expressed using SAMPLE k, where k can be a decimal value between 0 and 1, or a sufficiently large positive integer.

When k is a decimal between 0 and 1, the query uses ‘k’ as a percentage to select data. For example, a SAMPLE 0.1 query will only retrieve 10% of the total data. When k is a sufficiently large positive integer, the query uses ‘k’ as the maximum number of samples. For instance, a SAMPLE 10000000 query will only retrieve up to 10,000,000 rows of data.

Example:


SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
    AND toDate(EventDate) >= toDate('2013-01-29')
    AND toDate(EventDate) <= toDate('2013-02-04')
    AND NOT DontCountHits
    AND NOT Refresh
    AND Title != ''
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000;

In the example above, the query retrieves 0.1 (10%) of the total data. It is worth noting that the query does not automatically adjust the aggregation function’s final results, so to get a more accurate result, you need to manually multiply the result of count() by 10.

When conducting approximate queries using something like SAMPLE 10000000, this approach is not suitable for scenarios where there is no information about which data is processed or by which factor the aggregation function should be multiplied.

Results obtained with the same sampling rate are always consistent: if we can see all the possible data that exists in the table, then the same sampling rate will always yield the same result (when using the same sampling expression during table creation). In other words, the system samples data in the same way regardless of time, server, or table.

For example, we can get a list of user IDs that is the same as that obtained without sampling using the sampling method. This means you can use sampling in an IN subquery, or use the sampled results to join with other queries.

ARRAY JOIN Clause

The ARRAY JOIN clause helps query and join with arrays and nested data types. It’s somewhat similar to the arrayJoin function but offers broader functionality.

ARRAY JOIN is essentially equivalent to INNER JOIN arrays. For example:


CREATE TABLE ck_array(id UInt8,scores Array(UInt8)) ENGINE = Memory;


INSERT INTO ck_array VALUES (1,[89,91]),(2,[77,67]),(3,[79,89])


select * from ck_array;


SELECT id, scores FROM ck_array ARRAY JOIN scores

Handling Null in Join Statements

Please refer to join_use_nulls , Nullable , and NULL .

WHERE Clause

If a WHERE clause exists, it must contain a UInt8 type expression. This expression is usually one with comparisons and logic. This expression will be used to filter data before all data transformations.

If this expression exists in a database table engine that supports indexing, it will be evaluated for index usage.

PREWHERE Clause

This clause has the same meaning as the WHERE clause. The main difference is in table data reading. When PREWHERE is used, only the columns needed by the PREWHERE expression are read initially. Then, based on the result of executing the PREWHERE, other necessary columns are read.

If there are a small number of columns not suitable for index filtering in the filter conditions but can provide strong filtering capabilities, using PREWHERE is meaningful as it helps reduce data reading.

For example, writing PREWHERE for a few columns in a query that needs to extract many columns is very effective.

Tip: PREWHERE is supported only in *MergeTree series engines. You can specify both PREWHERE and WHERE in a query, in which case, PREWHERE is executed before WHERE. PREWHERE is not suitable for columns already present in the index, as only data blocks that satisfy the index will be read when columns already exist in the index. If ‘optimize_move_to_prewhere’ is set to 1 and PREWHERE is not included in the query, the system will automatically extract parts suitable for PREWHERE expressions from WHERE to PREWHERE.

WITH TOTALS Modifier

If you specify the WITH TOTALS modifier, you will get an additional calculated row in the result. In this row, all key default values (zero or null values) will be included, as well as the aggregation results of all aggregation functions for all selected data rows.

This row is output separately from other rows only in JSON*, TabSeparated*, Pretty* output formats.