Explore the logic behind building data cubes using dedicated servers

Evaluating the Approach: Offloading Data Cube Updates with Kafka and SQL Replica

The proposed approach to offload data cube updates from the production server, utilizing Kafka for asynchronous processing and reading data from an Always On SQL Replica, is highly effective for several reasons:

/files/dataCubeUpdate.png{/content:image}

Strengths of the Approach

  1. Reduced Load on the Production Server:
    • Moving data cube updates off the darwin.Cloud production server ensures that real-time user operations (data insertion, updates, deletions) are not slowed down by resource-intensive cube calculations.
  2. Asynchronous Processing with Kafka:
    • Using Kafka for event-driven updates allows for a decoupled and scalable architecture. Kafka’s delayed processing aligns well with the asynchronous nature of SQL Replicas and avoids relying on real-time updates, which are unnecessary for data cubes.
  3. Utilizing SQL Replica for Cube Updates:
    • By reading data from the Always On SQL Replica instead of the production server, you distribute the workload effectively, leveraging the Replica for read-intensive tasks like cube updates.
  4. Built-In Delay for Data Synchronization:
    • Introducing a delay in the Kafka processing ensures that the SQL Replica is fully synchronized before attempting to update the cube, reducing the risk of stale or incomplete data.
  5. Scalability and Fault Tolerance:
    • Kafka’s ability to handle large volumes of events and ensure delivery even in the case of failures enhances the reliability of this approach.

Potential Challenges and Mitigations

  1. Data Latency:
    • Delayed updates by 10 minutes means that the latest data changes will not be visible immediately - but this is acceptable given that the darwin data cubes do not require real-time updates.
  2. Complexity in Architecture:
    • Adding Kafka and a secondary processing server increases system complexity, which requires robust monitoring and maintenance.

Recommended Approach and Best Practices

1. Event Generation on the Production Server

  • Triggering Events:
    • Use application-level logic to send messages to a Kafka topic whenever a data change (INSERT, UPDATE, DELETE) occurs in the production database.
    • The Kafka message should include relevant metadata such as:
      • Table name.
      • Operation type (INSERT, UPDATE, DELETE).
      • Timestamp of the operation.
      • Primary key or identifiers for the updated rows.

2. Kafka for Asynchronous Processing

  • Kafka Topic:
    • Create a Kafka topic specifically for data cube updates (e.g., data-cube-updates).
  • Delay Implementation:
    • Use Kafka's Consumer Lag to introduce a delay in message processing.

3. Processing on a Separate Server

  • Deploy a Kafka “consumer application” on a separate server dedicated to handling data cube updates.
  • The consumer application reads messages from the data-cube-updates topic after the configured delay and retrieves the relevant data from the SQL Replica.

4. Reading from SQL Replica

  • Configure the “consumer application” to query the Always On SQL Replica for updated data.
  • Ensure the SQL Replica is optimized for read operations and can handle the cube update workload without affecting its performance.

5. Updating the Data Cube

  • Process the retrieved data on the secondary server to:
    • Insert new records into the cube.
    • Update modified records in the cube.
    • Delete outdated records from the cube.
  • Use efficient batch processing methods to minimize the time required to update the cube.

Final Workflow Summary

  1. On the Production Server:
    • Data change events (INSERT, UPDATE, DELETE) trigger a Kafka message.
  2. Kafka:
    • Messages are queued with a delay before processing.
  3. On the Replica Server:
    • A Kafka consumer reads delayed messages, retrieves the relevant data from the SQL Replica, and processes cube updates.
  4. Data Cube Server:
    • Updates to the data cube are performed using the data retrieved from the Replica.