Showing posts with label ClickHouse. Show all posts
Showing posts with label ClickHouse. Show all posts

Wednesday, 10 January 2024

Benchmarking Write Performance in ClickHouse

 This toolkit will measure how quickly ClickHouse can insert data into a table. We'll use Python's clickhouse-driverpackage for interacting with ClickHouse and the time module to measure the time taken for insert operations. Here's an outline of the script with explanations:

Key Components of the Script

  1. Setup Connection: Establish a connection to the ClickHouse server.
  2. Data Generation: Generate or read a dataset to insert into ClickHouse.
  3. Benchmark Function: A function that measures the time taken to insert data.
  4. Result Logging: Log the time taken for each insert operation.
  5. Repeat and Average: Repeat the insert operation multiple times to get an average performance metric.

Python Script for Benchmarking

import time
import random
from clickhouse_driver import Client

def generate_data(num_rows):
    # Generate sample data; here we use a simple schema for demonstration
    return [(random.randint(1, 1000), 'SampleString', random.random()) for _ in range(num_rows)]

def benchmark_insert_performance(client, data, batch_size):
    total_time = 0
    num_batches = len(data) // batch_size

    for _ in range(num_batches):
        start_time = time.time()
        client.execute('INSERT INTO test_table VALUES', data[:batch_size])
        end_time = time.time()

        batch_time = end_time - start_time
        total_time += batch_time
        print(f"Batch insert time: {batch_time:.4f} seconds")

        # Remove the data that's already been inserted
        data = data[batch_size:]

    avg_time = total_time / num_batches
    return avg_time

def main():
    # Parameters
    num_rows = 100000  # total number of rows to insert
    batch_size = 10000  # number of rows per batch

    # Setup ClickHouse connection
    client = Client('localhost')

    # Ensure you have the test_table created in ClickHouse with appropriate schema
    # Example: CREATE TABLE test_table (id Int32, name String, value Float64) ENGINE = MergeTree() ORDER BY id

    # Generate data
    data = generate_data(num_rows)

    # Benchmark
    avg_time = benchmark_insert_performance(client, data, batch_size)
    print(f"Average batch insert time: {avg_time:.4f} seconds")

if __name__ == "__main__":
    main()

Explanation of the Logic

  1. Data Generation: The generate_data function creates a list of tuples, each representing a row to be inserted into the ClickHouse database. The schema and data type should match your ClickHouse table's schema.

  2. Batch Insertion: In the benchmark_insert_performance function, data is inserted in batches. Batch insertion is generally more efficient in ClickHouse compared to row-by-row insertion.

  3. Timing Measurement: For each batch insert, the time before and after the operation is recorded. The difference gives the time taken to insert that batch.

  4. Average Calculation: After inserting all batches, the script calculates the average time taken per batch. This gives a fair idea of the insert performance.

  5. Connection and Table: Before running the script, ensure that the ClickHouse server is accessible and the target table (test_table in this case) is created with the correct schema.

  6. Flexibility: You can adjust num_rows and batch_size to test different volumes of data and batch sizes.

Prerequisites

  • Install clickhouse-driver: You need to install this Python package to run the script (pip install clickhouse-driver).
  • ClickHouse Server: Ensure the ClickHouse server is running and accessible from where the script is executed.
  • Table Schema: Modify the table name and schema in the script to match your actual ClickHouse setup.


Source: