Wednesday, 10 January 2024

Benchmarking Write Performance in ClickHouse

 This toolkit will measure how quickly ClickHouse can insert data into a table. We'll use Python's clickhouse-driverpackage for interacting with ClickHouse and the time module to measure the time taken for insert operations. Here's an outline of the script with explanations:

Key Components of the Script

  1. Setup Connection: Establish a connection to the ClickHouse server.
  2. Data Generation: Generate or read a dataset to insert into ClickHouse.
  3. Benchmark Function: A function that measures the time taken to insert data.
  4. Result Logging: Log the time taken for each insert operation.
  5. Repeat and Average: Repeat the insert operation multiple times to get an average performance metric.

Python Script for Benchmarking

import time
import random
from clickhouse_driver import Client

def generate_data(num_rows):
    # Generate sample data; here we use a simple schema for demonstration
    return [(random.randint(1, 1000), 'SampleString', random.random()) for _ in range(num_rows)]

def benchmark_insert_performance(client, data, batch_size):
    total_time = 0
    num_batches = len(data) // batch_size

    for _ in range(num_batches):
        start_time = time.time()
        client.execute('INSERT INTO test_table VALUES', data[:batch_size])
        end_time = time.time()

        batch_time = end_time - start_time
        total_time += batch_time
        print(f"Batch insert time: {batch_time:.4f} seconds")

        # Remove the data that's already been inserted
        data = data[batch_size:]

    avg_time = total_time / num_batches
    return avg_time

def main():
    # Parameters
    num_rows = 100000  # total number of rows to insert
    batch_size = 10000  # number of rows per batch

    # Setup ClickHouse connection
    client = Client('localhost')

    # Ensure you have the test_table created in ClickHouse with appropriate schema
    # Example: CREATE TABLE test_table (id Int32, name String, value Float64) ENGINE = MergeTree() ORDER BY id

    # Generate data
    data = generate_data(num_rows)

    # Benchmark
    avg_time = benchmark_insert_performance(client, data, batch_size)
    print(f"Average batch insert time: {avg_time:.4f} seconds")

if __name__ == "__main__":
    main()

Explanation of the Logic

  1. Data Generation: The generate_data function creates a list of tuples, each representing a row to be inserted into the ClickHouse database. The schema and data type should match your ClickHouse table's schema.

  2. Batch Insertion: In the benchmark_insert_performance function, data is inserted in batches. Batch insertion is generally more efficient in ClickHouse compared to row-by-row insertion.

  3. Timing Measurement: For each batch insert, the time before and after the operation is recorded. The difference gives the time taken to insert that batch.

  4. Average Calculation: After inserting all batches, the script calculates the average time taken per batch. This gives a fair idea of the insert performance.

  5. Connection and Table: Before running the script, ensure that the ClickHouse server is accessible and the target table (test_table in this case) is created with the correct schema.

  6. Flexibility: You can adjust num_rows and batch_size to test different volumes of data and batch sizes.

Prerequisites

  • Install clickhouse-driver: You need to install this Python package to run the script (pip install clickhouse-driver).
  • ClickHouse Server: Ensure the ClickHouse server is running and accessible from where the script is executed.
  • Table Schema: Modify the table name and schema in the script to match your actual ClickHouse setup.


Source: 


No comments:

Post a Comment