The summary of ‘Meta Interview Question | System Design: Ad Click Counter & Aggregator’

This summary of the video was created by an AI. It might contain some inaccuracies.

00:00:0001:20:19

The YouTube video discusses the ad click aggregator problem, focusing on system design and data storage for handling high volumes of ad clicks. Key points include functional and non-functional requirements, analytics related to page views, data storage and aggregation using platforms like Kafka and Cassandra, and strategies for handling hot partitioning issues. The speaker delves into potential problems with Kafka, the importance of partitioning keys, and different data processing architectures like Lambda and Kappa. Discussions also cover real-time data processing, updating records, and handling millions of write operations per minute. The video concludes with an invitation for viewers to engage in code competitions and ask questions in the Discord Channel.

00:00:00

In this part of the video, system design Fight Club is covering the ad click aggregator problem based on Alex Chu’s second book. The speaker discusses functional and non-functional requirements, such as recording ad click events and querying them like Google Analytics. Key details include handling up to one billion ad clicks per day, two million unique ads, a few minutes of latency, and storing 36.5 terabytes of data per year for 10 years. Bandwidth calculations estimate an average of one megabyte per second, with peak usage at five megabytes per second. Storage requirements are also discussed, indicating the need for multiple hard disks for long-term data retention.

00:10:00

In this segment of the video, the speaker discusses the analytics related to page views per second and breaks down the numbers for monthly active users for analysts, daily active users, and page views per day. The analysis reveals 10 page views per second for analysts and up to 50,000 ad clicks with an average of 10,000. The focus then shifts to diagramming the flow for loading static assets, server-side rendering, ad placement service, CDN, and the click capture service. The importance of high availability for capturing ad click events, avoidance of double charges, revenue generation, and storage using Kafka for high-volume data processing is emphasized. The request and response contract details for capturing ad click data are also outlined, covering add ID, user metadata, timestamp, and redirect events.

00:20:00

In this segment of the video, the speaker discusses data storage and aggregation. They mention that the data store will hold 365 terabytes of data over 10 years. They explain the schema, including fields for ad ID, click timestamp, metadata, UUID, and placement. The importance of UUID for deduplication is emphasized. The conversation later moves on to comparing Kafka and SQS, touching on topics like back pressure, ordering, and throughput. Finally, the speaker discusses the use of UUIDs for ensuring uniqueness and item potency in avoiding double charging, mentioning ongoing discussions on achieving exactly one semantics.

00:30:00

In this segment of the video, the speaker discusses partitioning strategies using Cassandra or a time series database, focusing on data locality for aggregation queries. They address potential hot partition issues and propose partitioning by ad ID, despite the trade-off of data locality. The decision to use Cassandra for consistent hashing is mentioned for mitigating hot partition issues. The conversation also touches on handling redirect events, ensuring Kafka reliability for click captures, and managing URLs for ads. The speaker suggests using Redis or DynamoDB for storing URLs, and they discuss the trade-offs between latency and cost. The segment concludes with a discussion on handling click capture events and ensuring availability in case of storage downtimes.

00:40:00

In this segment of the video, the speaker discusses potential issues with Kafka related to a “Noisy Neighbor” problem where high-volume events from a big company like Coca-Cola may delay processing for smaller companies. They touch upon using time series DB for easier querying and the importance of partitioning keys like uuid in distributing loads. Additionally, the speaker mentions the use of mapreduce with Cassandra for processing clicks and the distinction between Lambda and Kappa architectures for real-time event processing. They also hint at exploring Flink for real-time data processing. The conversation delves into explaining the sorting and partitioning mechanisms within databases to optimize data processing.

00:50:00

In this segment of the video, the speaker discusses the issue of hot partitioning, which leads to higher latency against specific partitions. The speaker explores using consistent hashing and a random number plus ad ID approach to distribute data evenly. They also touch upon the need to store historical data efficiently without deletions, suggesting a tiered storage approach. Additionally, the speaker mentions setting up a second data store with one-minute granularity for aggregated data and options to run mapreduce against it. They compare different schema setups and mention using Lambda functions for trigger-based processing. Strategies to handle straggler events are also discussed for ensuring data accuracy.

01:00:00

In this segment of the video, the speaker discusses handling updated records and analytics services. They mention the Click count, back-end services for analysts, and using an Analytics service to retrieve data. The speaker also talks about storing and processing real-time data, using MapReduce and Spark Streaming. They cover various approaches to handling historical and real-time data, including querying Flink directly. The speaker explains aggregation queries for each ad ID and suggests using aggregation query threads instead of running queries for every ad ID. They discuss updating records using Cassandra and avoiding excessive queries to the ad capture data store. Lastly, they mention DB choices for aggregated click storage like PostgresQL or DynamoDB, and mention the reads per second requirements for analysts.

01:10:00

In this segment of the video, the speaker discusses the potential for millions of write operations per minute due to recording every ad ID interaction. They mention the importance of sharding, leaning towards Dyno DB for this purpose. The speaker emphasizes starting the partitioning by ad ID for data locality benefits. They discuss the impact of high read latency with scatter-gather operations and potential delays due to hot partitioning. The speaker also talks about the aggregation process and the necessity of updating multiple writers in the same database. The discussion extends to the use of map-reduce jobs every minute, considering alternatives like Lambda functions or Flink/Spark streaming for aggregation tasks. They express interest in exploring different technologies for efficient data aggregation.

01:20:00

In this part of the video, the speaker mentions hosting code competitions regularly since January and encourages viewers to ask more questions in the Discord Channel. They express gratitude to the audience for joining and give thanks to Michael.

Scroll to Top