Q: What's the functional requirements and non-functional requirements? Advanced requirements if we have enough time
A: Functional Requirements:
- Users should be able to view events
- Users should be able to search events
- Users should be able to book tickets to events Non-functional requirements:
- The system should prioritize the availability for search & view events, but should prioritize the consistency for booking events
- The system should be scalable and able to handle high throughput in a form of popular events. (10 million users)
- The system should have low latency search
- The system is read heavy, and thus need to be able to support high read throughput 100:1
Q: How to define core entites
A:
- Event: date, description, performer or team
- Ticket
- Booking
- User
- Performer:
- Venue:
Q: How to define api or system interface
A
- GET /events/search -> event[]
- GET /events/:eventId -> event
- POST /booking/:eventId -> bookingId
Q: 从/events/:eventId 接口开始设计,画一个系统架构图。 Draw a high level design
先设计一个简单系统架构,client->api gateway->server->database
Q: 如何设计GET /events/search接口?
A: 将search server和 event server分开,因为search的读操作会非常大,独立出来方便扩展,后续还会接入elastic search。 所以api gateway 会将/event/search 接口指向 search server。 search server 直接从数据库读取数据库。
Q: 如何设计Booking ?
A: 由于我们的预定系统有数据一致性的要求,所以我们需要选择支持事务的数据库,比如mysql, postgresql, dynamodb Additionally, we need to implement proper isolation levels and either row-level locking or Optimistic Concurrency Control (OCC) to fully prevent double bookings.
The main thing we are trying to avoid is two (or more) users paying for the same ticket. That would make for an awkward situation at the event! To handle this consistency issue, we need to select a database that supports transactions. This will allow us to ensure that only one user can book a ticket at a time.
While anything from MySQL to DynamoDB would be fine choices (just needs ACID properties), we'll opt for PostgreSQL. Additionally, we need to implement proper isolation levels and either row-level locking or Optimistic Concurrency Control (OCC) to fully prevent double bookings.
-
New Tables in Events DB: First we add two new tables to our database, Bookings and Tickets. The Bookings table will store the details of each booking, including the user ID, ticket IDs, total price, and booking status. The Tickets table will store the details of each ticket, including the event ID, seat details, pricing, and status. The Tickets table will also have a bookingId column that links it to the Bookings table.
-
Booking Service: This microservice is responsible for the core functionality of the ticket booking process. It interacts with databases that store data on bookings and tickets.
-
It interfaces with the Payment Processor (Stripe) for transactions. Once a payment is confirmed, the booking service updates the ticket status to "sold".
-
It communicates with the Bookings and Tickets tables to fetch, update, or store relevant data.
-
-
Payment Processor (Stripe): An external service responsible for handling payment transactions. Once a payment is processed, it notifies the booking service of the transaction status.
When a user goes to book a ticket, the following happens:
- The user is redirected to a booking page where they can provide their payment details and confirm the booking.
- Upon confirmation, a POST request is sent to the /bookings endpoint with the selected ticket IDs.
- The booking server initiates a transaction to:
- Check the availability of the selected tickets.
- Update the status of the selected tickets to “booked”.
- Create a new booking record in the Bookings table.
- If the transaction is successful, the booking server returns a success response to the client. Otherwise, if the transaction failed because another user booked the ticket in the meantime, the server returns a failure response and we pass this information back to the client.
Note, this means that when a new event is created we need to create a new ticket for for each seat in the venue. Each of which will be available for purchase until it is booked.
You may have noticed there is a fundamental issue with this design. Users can get to the booking page, type in their payment details, and then find out that the ticket they wanted is no longer available. This would suck and is something that we are going to discuss how to avoid later on in our deep dives. For now, we have a simple implementation that meets the functional requirement.
Q: 如何改进预定票的用户体验?
A: 之前的设计,会有体验问题,假设两个用户同时选择了一个位置,进入到了预定页面,第一个将支付信息输入完成并且提交预定的用户将成功预定,而第二个用户将预约失败,如何改进? 我们可以新增一个中间状态,reserving, 帮助用户支付前锁定门票,但是我们要确保用户不想要支付了,可以自动解锁,或者用户支付成功了将订单状态转换为支付成功
有几种技术方式实现:
- 悲观锁
- cron定时任务。 添加status field and expiration time on the ticket table.
- 隐式状态,包含状态和到期时间. 不用到cron
- Distributed lock with ttl
Now, when a user wants to book a ticket:
- A user will select a seat from the interactive seat map. This will trigger a POST /bookings with the ticketId associated with that seat.
- The request will be forwarded from our API gateway onto the Booking Service.
- The Booking Service will lock that ticket by adding it to our Redis Distributed Lock with a TTL of 10 minutes (this is how long we will hold the ticket for).
- The Booking Service will also write a new booking entry in the DB with a status of in-progress.
- We will then respond to the user with their newly created bookingId and route the client to a the payment page.
- If the user stops here, then after 10 minutes the lock is auto-released and the ticket is available for another user to purchase.
- The user will fill out their payment details and click “Purchase.” In doing so, the payment (along with the bookingId) gets sent to Stripe for processing and Stripe responds via webhook that the payment was successful.
- Upon successful payment confirmation from Stripe, our system's webhook retrieves the bookingId embedded within the Stripe metadata. With this bookingId, the webhook initiates a database transaction to concurrently update the Ticket and Booking tables. Specifically, the status of the ticket linked to the booking is changed to "sold" in the Ticket table. Simultaneously, the corresponding booking entry in the Booking table is marked as "confirmed."
- Now the ticket is booked!
Good Solution: Status & Expiration Time with Cron
Approach
A better solution is to lock the ticket by adding a status field and expiration time on the ticket table. The ticket can then be in 1 of 3 states: available, reserved, booking. This allows us to track the status of each ticket and automatically release the lock once the expiration time is reached. When a user selects a ticket, the status changes from "available" to "reserved", and the current timestamp is recorded.
Now lets think about how we handle unlocking with this approach:
-
If the user finalizes the purchase, the status changes to "booked", and the lock is released.
-
If the user takes too long or abandons the purchase, the status changes back to "available" once the expiration time is reached, and the lock is released. The tricky part here is how we handle the expiration time. We could use a cron job to periodically query for rows with "reserved" status where the time elapsed exceeds the lock duration and then set them back to "available". This is much better, but the lag between the elapsed time and the time the row is changed is not ideal for popular events. Ideally, the lock should be removed almost exactly after expiration.
Challenges
Whether we use the cron job or on-demand approach, they both have significant drawbacks:
Cron Job Approach:
-
Delay in Unlocking: There's an inherent delay between the ticket expiration and the cron job execution, leading to inefficiencies, particularly for high-demand events. Tickets might remain unavailable for purchase even after the expiration time, reducing booking opportunities.
-
Reliability Issues: If the cron job experiences failures or delays, it can cause significant disruptions in the ticket booking process, leading to customer dissatisfaction and potential revenue loss.
Great Solution: Implicit Status with Status and Expiration Time
Approach
We can do even better than our cron-based solution by recognizing the the status of any given ticket is the combination of two attributes: whether it's available OR whether it's been reserved but the reservation has expired. Rather than using long-running interactive transactions or locks, we can create short transactions to update the fields on the ticket record (e.g. changing "available" to "reserved" and setting expiration to +10 minutes). Inside these transactions we can confirm the ticket is available before reserving it or that the expiration on the previous reservation has passed.
So, in pseudocode, our transaction looks like this:
-
We begin a transaction
-
We check to see if the current ticket is either AVAILABLE or (RESERVED but expired)
-
We update the ticket to RESERVED with expiration now + 10 minutes.
-
We commit the transaction.
This guarantees only one user will be able to reserve or book a ticket and that users will be able to claim tickets from expired reservations.
Challenges
Our read operations are going to be be slightly slower by needing to filter on two values. We can partly solve this by utilizing materialized views or other features of modern RDBMS's together with a compound index. Our database table is also less legible for other consumers of the data, since some reservations are actually expired. We can solve that problem by utilizing a cron or periodic sweep job like above, with the very important difference that our system behavior will not be effected if that sweep is delayed.
Great Solution: Distributed Lock with TTL
Approach
Another great solution is to implement a distributed lock (opens in a new tab) with a TTL (Time To Live) using a distributed system like Redis (opens in a new tab). Redis is an in-memory data store that supports distributed locks and is well-suited for high-concurrency environments. It offers high availability and can be used to implement a distributed lock mechanism for the ticket booking process. Here is how it would work:
-
When a user selects a ticket, acquire a lock in Redis using a unique identifier (e.g., ticket ID) with a predefined TTL (Time To Live). This TTL acts as an automatic expiration time for the lock.
-
If the user completes the purchase, the ticket's status in the database is updated to "Booked", and the lock in Redis is manually released by the application after the TTL.
-
If the TTL expires (indicating the user did not complete the purchase in time), Redis automatically releases the lock. This ensures that the ticket becomes available for booking by other users without any additional intervention.
Now our Ticket table only has two states: available and booked. Locking of reserved tickets is handled entirely by Redis. The key-value pair in Redis is the ticket ID and the value is the user ID. This way we can ensure that when a user confirms the booking, they are the user who reserved the ticket.
Challenges
The main downside comes with handling failures. If our lock goes down for any reason, then we have a period of time where user experience is degraded. Note that we will still never have a "double booking" since our database will use OCC or any other concurrency control to ensure this. The downside is just that users can get an error after filling out their payment details if someone beats them to it. This sucks, but I would argue that it is a better outcome than having all tickets appear unavailable (as would be the case if the cron job in our previous solution failed).
Q. How to scale to support 10 million of concurrent requests during peak time?
scale的一般套路,cache,load balancer, horizontal scaling Great Solution: Caching, Load Balancing, and Horizontal Scaling
Approach
Caching
-
Prioritize caching for data with high read rates and low update frequency, such as event details (names, dates, venue information), performer bios, and static venue details like location and capacity. Because this data does not change frequently, we can cache it like crazy to heavily minimize the load of our SQL DB and meet our high availability requirements.
-
Cache (opens in a new tab) key-value pairs like eventId:eventObject to efficiently serve frequently accessed data.
-
Utilize Redis or Memcached as in-memory data stores, leveraging their speed for handling large volumes of read operations. A read-through cache strategy ensures data availability, with cache misses triggering a database read and subsequent cache update.
-
Cache Invalidation and Consistency:
-
Set up database triggers to notify the caching system of data changes, such as updates in event dates or performer lineups, to invalidate relevant cache entries.
-
Implement a Time-to-Live policy for cache entries, ensuring periodic refreshes. These TTLs can be long for static data like venue information and short for frequently updated data like event availability.
-
Load Balancing
- Use algorithms like Round Robin or Least Connections for even traffic distribution across server instances. Implement load balancing for all horizontally scaled services and databases. You typically don't need to show this in your design, but it's good to mention it.
Horizontal Scaling
- The Event Service is stateless which allows us to horizontally scale it to meet demand. We can do this by adding more instances of the service and load balancing between them.
Challenges
-
One of the primary challenges is maintaining consistency between the cache and the database. This is particularly challenging with frequent updates to event details (but we don't expect this)
-
Managing a large number of instances presents complexities. Ensuring smooth deployment and effective rollback procedures adds to the operational challenges.
Q: How will the system ensure a good user experience during high-demand events with millions simultaneously booking tickets?
With popular events, the loaded seat map will go stale quickly. Users will grow frustrated as they repeatedly click on a seat, only to find out it has already been booked. We need to ensure that the seat map is always up to date and that users are notified of changes in real-time.
Sometimes the best solution is actually not technically more challenging. The mark of a senior/staff engineer is their ability to solve business problems and sometimes that means thinking outside the box of presumed constraints. The below Good and Great solutions are illustrative of a common delta between senior and staff candidates.
Good Solution: SSE for Real-Time Seat Updates
Approach
To ensure that the seat map is always up to date, we can use Server-Sent Events (SSE) (opens in a new tab) to push updates to the client in real-time. This will allow us to update the seat map as soon as a seat is booked (or reserved) by another user without needing to refresh the page. SSE is a unidirectional communication channel between the server and the client. It allows the server to push data to the client without the client having to request it.
Challenges
While this approach works well for moderately popular events, the user experience will still suffer during extremely popular events. In the "Taylor Swift case," for example, the seat map will immediately fill up, and users will be left with a disorienting and overwhelming experience as available seats disappear in an instant.
Great Solution: Virtual Waiting Queue for Extremely Popular Events
Approach
For extremely popular events, we can implement an admin enabled virtual waiting queue system to manage user access during times of exceptionally high demand. Users are placed in this queue before even being able to see the booking page (seat map selected). It is designed to control the flow of users accessing the booking interface, thereby preventing system overload and enhancing the user experience. Here is how it would work at a high level:
-
When a user requests to view the booking page, they are placed in a virtual queue. We establish a WebSocket connection with their client and add the user to the queue using their unique WebSocket connection.
-
Periodically or based on certain criteria (like tickets booked), dequeue users from the front of the queue. Notify these users via their WebSocket connection that they can proceed to purchase tickets.
-
At the same time, update the database to reflect that this user is now allowed to access the ticket purchasing system.
Challenges
Long wait times in the queue might lead to user frustration, especially if the estimated wait times are not accurate or if the queue moves slower than expected. By pushing updates to the client in real-time, we can mitigate this risk by providing users with constant feedback on their queue position and estimated wait time.
Virtual Waiting Queue
Q: How can you improve search to ensure we meet our low latency requirements?
Our current search implementation is not going to cut it. Queries to search for events based on keywords in the name, description, or other fields will require a full table scan because of the wildcard in the LIKE clause. This can be very slow, especially as the number of events grows.
-- slow query SELECT * FROM Events WHERE name LIKE '%Taylor%' OR description LIKE '%Taylor%'
Let's look at some strategies to improve search performance and ensure we meet our low latency requirements.
Good Solution: Indexing & SQL Query Optimization
Great Solution: Full-text Indexes in the DB
Great Solution: Use a Full-text Search Engine like Elasticsearch
Elasticsearch
Q: How can you speed up frequently repeated search queries and reduce load on our search infrastructure?
Good Solution: Implement Caching Strategies Using Redis or Memcached
Great Solution: Great Answer: Implement Query Result Caching and Edge Caching Techniques
As you progress through the deep dives, you should be updating your design to reflect the changes you are making. After doing so, you could have a final design like looks something like this:
Final Design
Visual communication is important! Your interviewer is busy. They are likely going to wrap up the interview, go into a long day of meetings, go home tired, and then come back the next morning to remember that they need to write feedback for the interview they conducted the day before. They're then going to pull up your design and try to remember what you said. Make their life easier and improve your own chances by making your visual design as clear as possible.