Working with CSV and TSV data in ClickHouse
ClickHouse supports importing data from and exporting to CSV. Since CSV files can come with different format specifics, including header rows, custom delimiters, and escape symbols, ClickHouse provides formats and settings to address each case efficiently.
Importing data from a CSV file
Before importing data, let’s create a table with a relevant structure:
CREATE TABLE sometable
(
`path` String,
`month` Date,
`hits` UInt32
)
ENGINE = MergeTree
ORDER BY tuple(month, path)
To import data from the CSV file to the sometable
table, we can pipe our file directly to the clickhouse-client:
clickhouse-client -q "INSERT INTO sometable FORMAT CSV" < data_small.csv
Note that we use FORMAT CSV to let ClickHouse know we’re ingesting CSV formatted data. Alternatively, we can load data from a local file using the FROM INFILE clause:
INSERT INTO sometable
FROM INFILE 'data_small.csv'
FORMAT CSV
Here, we use the FORMAT CSV
clause so ClickHouse understands the file format. We can also load data directly from URLs using url() function or from S3 files using s3() function.
We can skip explicit format setting for file()
and INFILE
/OUTFILE
.
In that case, ClickHouse will automatically detect format based on file extension.
CSV files with headers
Suppose our CSV file has headers in it:
head data-small-headers.csv
"path","month","hits"
"Akiba_Hebrew_Academy","2017-08-01",241
"Aegithina_tiphia","2018-02-01",34
To import data from this file, we can use CSVWithNames format:
clickhouse-client -q "INSERT INTO sometable FORMAT CSVWithNames" < data_small_headers.csv
In this case, ClickHouse skips the first row while importing data from the file.
Starting from 23.1 version ClickHouse will automatically detect headers in CSV files when CSV
type is used, so no need to use CSVWithNames
or CSVWithNamesAndTypes
.
CSV files with custom delimiters
In case the CSV file uses other than comma delimiter, we can use the format_csv_delimiter option to set the relevant symbol:
SET format_csv_delimiter = ';'
Now, when we import from a CSV file, ;
symbol is going to be used as a delimiter instead of a comma.
Skipping lines in a CSV file
Sometimes, we might skip a certain number of lines while importing data from a CSV file. This can be done using input_format_csv_skip_first_lines option:
SET input_format_csv_skip_first_lines = 10
In this case, we’re going to skip the first ten lines from the CSV file:
SELECT count(*) FROM file('data-small.csv', CSV)
┌─count()─┐
│ 990 │
└─────────┘
The file has 1k rows, but ClickHouse loaded only 990 since we’ve asked to skip the first 10.
When using the file()
function, with ClickHouse Cloud you will need to run the commands in clickhouse client
on the machine where the file resides. Another option is to use clickhouse-local
to explore files locally.
Treating NULL values in CSV files
Null values can be encoded differently depending on the application that generated the file. By default, ClickHouse uses \N
as a Null value in CSV. But we can change that using the format_csv_null_representation option.
Suppose we have the following CSV file:
> cat nulls.csv
Donald,90
Joe,Nothing
Nothing,70
If we load data from this file, ClickHouse will treat Nothing
as a String (which is correct):
SELECT * FROM file('nulls.csv')
┌─c1──────┬─c2──────┐
│ Donald │ 90 │
│ Joe │ Nothing │
│ Nothing │ 70 │
└─────────┴─────────┘
If we want ClickHouse to treat Nothing
as NULL
, we can define that using the following option:
SET format_csv_null_representation = 'Nothing'
Now we have NULL
where we expect it to be:
SELECT * FROM file('nulls.csv')
┌─c1─────┬─c2───┐
│ Donald │ 90 │
│ Joe │ ᴺᵁᴸᴸ │
│ ᴺᵁᴸᴸ │ 70 │
└────────┴──────┘
TSV (Tab-separated) files
Tab-separated data format is widely used as a data interchange format. To load data from a TSV file to ClickHouse, the TabSeparated format is used:
clickhouse-client -q "INSERT INTO sometable FORMAT TabSeparated" < data_small.tsv
There’s also a TabSeparatedWithNames format to allow working with TSV files that have headers. And, like for CSV, we can skip the first X lines using the input_format_tsv_skip_first_lines option.
Raw TSV
Sometimes, TSV files are saved without escaping tabs and line breaks. We should use TabSeparatedRaw to handle such files.
Exporting to CSV
Any format in our previous examples can also be used to export data. To export data from a table (or a query) to a CSV format, we use the same FORMAT
clause:
SELECT *
FROM sometable
LIMIT 5
FORMAT CSV
"Akiba_Hebrew_Academy","2017-08-01",241
"Aegithina_tiphia","2018-02-01",34
"1971-72_Utah_Stars_season","2016-10-01",1
"2015_UEFA_European_Under-21_Championship_qualification_Group_8","2015-12-01",73
"2016_Greater_Western_Sydney_Giants_season","2017-05-01",86
To add a header to the CSV file, we use the CSVWithNames format:
SELECT *
FROM sometable
LIMIT 5
FORMAT CSVWithNames
"path","month","hits"
"Akiba_Hebrew_Academy","2017-08-01",241
"Aegithina_tiphia","2018-02-01",34
"1971-72_Utah_Stars_season","2016-10-01",1
"2015_UEFA_European_Under-21_Championship_qualification_Group_8","2015-12-01",73
"2016_Greater_Western_Sydney_Giants_season","2017-05-01",86
Saving exported data to a CSV file
To save exported data to a file, we can use the INTO…OUTFILE clause:
SELECT *
FROM sometable
INTO OUTFILE 'out.csv'
FORMAT CSVWithNames
36838935 rows in set. Elapsed: 1.304 sec. Processed 36.84 million rows, 1.42 GB (28.24 million rows/s., 1.09 GB/s.)
Note how it took ClickHouse ~1 second to save 36m rows to a CSV file.
Exporting CSV with custom delimiters
If we want to have other than comma delimiters, we can use the format_csv_delimiter settings option for that:
SET format_csv_delimiter = '|'
Now ClickHouse will use |
as a delimiter for CSV format:
SELECT *
FROM sometable
LIMIT 5
FORMAT CSV
"Akiba_Hebrew_Academy"|"2017-08-01"|241
"Aegithina_tiphia"|"2018-02-01"|34
"1971-72_Utah_Stars_season"|"2016-10-01"|1
"2015_UEFA_European_Under-21_Championship_qualification_Group_8"|"2015-12-01"|73
"2016_Greater_Western_Sydney_Giants_season"|"2017-05-01"|86
Exporting CSV for Windows
If we want a CSV file to work fine in a Windows environment, we should consider enabling output_format_csv_crlf_end_of_line option. This will use \r\n
as a line breaks instead of \n
:
SET output_format_csv_crlf_end_of_line = 1;
Schema inference for CSV files
We might work with unknown CSV files in many cases, so we have to explore which types to use for columns. Clickhouse, by default, will try to guess data formats based on its analysis of a given CSV file. This is known as "Schema Inference". Detected data types can be explored using the DESCRIBE
statement in pair with the file() function:
DESCRIBE file('data-small.csv', CSV)
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(Date) │ │ │ │ │ │
│ c3 │ Nullable(Int64) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
Here, ClickHouse could guess column types for our CSV file efficiently. If we don’t want ClickHouse to guess, we can disable this with the following option:
SET input_format_csv_use_best_effort_in_schema_inference = 0
All column types will be treated as a String
in this case.
Exporting and importing CSV with explicit column types
ClickHouse also allows explicitly setting column types when exporting data using CSVWithNamesAndTypes (and other *WithNames formats family):
SELECT *
FROM sometable
LIMIT 5
FORMAT CSVWithNamesAndTypes
"path","month","hits"
"String","Date","UInt32"
"Akiba_Hebrew_Academy","2017-08-01",241
"Aegithina_tiphia","2018-02-01",34
"1971-72_Utah_Stars_season","2016-10-01",1
"2015_UEFA_European_Under-21_Championship_qualification_Group_8","2015-12-01",73
"2016_Greater_Western_Sydney_Giants_season","2017-05-01",86
This format will include two header rows - one with column names and the other with column types. This will allow ClickHouse (and other apps) to identify column types when loading data from such files:
DESCRIBE file('data_csv_types.csv', CSVWithNamesAndTypes)
┌─name──┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ path │ String │ │ │ │ │ │
│ month │ Date │ │ │ │ │ │
│ hits │ UInt32 │ │ │ │ │ │
└───────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
Now ClickHouse identifies column types based on a (second) header row instead of guessing.
Custom delimiters, separators, and escaping rules
In sophisticated cases, text data can be formatted in a highly custom manner but still have a structure. ClickHouse has a special CustomSeparated format for such cases, which allows setting custom escaping rules, delimiters, line separators, and starting/ending symbols.
Suppose we have the following data in the file:
row('Akiba_Hebrew_Academy';'2017-08-01';241),row('Aegithina_tiphia';'2018-02-01';34),...
We can see that individual rows are wrapped in row()
, lines are separated with ,
and individual values are delimited with ;
. In this case, we can use the following settings to read data from this file:
SET format_custom_row_before_delimiter = 'row(';
SET format_custom_row_after_delimiter = ')';
SET format_custom_field_delimiter = ';';
SET format_custom_row_between_delimiter = ',';
SET format_custom_escaping_rule = 'Quoted';
Now we can load data from our custom formatted file:
SELECT *
FROM file('data_small_custom.txt', CustomSeparated)
LIMIT 3
┌─c1────────────────────────┬─────────c2─┬──c3─┐
│ Akiba_Hebrew_Academy │ 2017-08-01 │ 241 │
│ Aegithina_tiphia │ 2018-02-01 │ 34 │
│ 1971-72_Utah_Stars_season │ 2016-10-01 │ 1 │
└───────────────────────────┴────────────┴─────┘
We can also use CustomSeparatedWithNames to get headers exported and imported correctly. Explore regex and template formats to deal with even more complex cases.
Working with large CSV files
CSV files can be large, and ClickHouse works efficiently with files of any size. Large files usually come compressed, and ClickHouse covers this with no need for decompression before processing. We can use a COMPRESSION
clause during an insert:
INSERT INTO sometable
FROM INFILE 'data_csv.csv.gz'
COMPRESSION 'gzip' FORMAT CSV
If a COMPRESSION
clause is omitted, ClickHouse will still try to guess file compression based on its extension. The same approach can be used to export files directly to compressed formats:
SELECT *
FROM for_csv
INTO OUTFILE 'data_csv.csv.gz'
COMPRESSION 'gzip' FORMAT CSV
This will create a compressed data_csv.csv.gz
file.
Other formats
ClickHouse introduces support for many formats, both text, and binary, to cover various scenarios and platforms. Explore more formats and ways to work with them in the following articles:
- CSV and TSV formats
- Parquet
- JSON formats
- Regex and templates
- Native and binary formats
- SQL formats
And also check clickhouse-local - a portable full-featured tool to work on local/remote files without the need for Clickhouse server.