Q: Functional & non-functional requirements A: Core Requirements
- Users should be able to upload a file from any device
- Users should be able to download a file from any device
- Users should be able to share a file with other users and view the files shared with them Core Requirements
- The system should be highly available (prioritizing availability over consistency).
- The system should support files as large as 50GB.
- The system should be secure and reliable. We should be able to recover files if they are lost or corrupted.
- The system should make upload and download times as fast as possible (low latency).
Defining the Core Entities (opens in a new tab)
I like to start with a broad overview of the primary entities. At this stage, it is not necessary to know every specific column or detail. We will focus on these intricacies later when we have a clearer grasp of the system (during the high-level design). Initially, establishing these key entities will guide our thought process and lay a solid foundation as we progress towards defining the API.
For Dropbox, the primary entities are incredily straightforward:
- File: This is the raw data that users will be uploading, downloading, and sharing.
- FileMetadata: This is the metadata associated with the file. It will include information like the file's name, size, mime type, and the user who uploaded it.
- User: The user of our system.
Q: API or System Interface
The API is the primary interface that users will interact with. It's important to define the API early on, as it will guide your high-level design. We just need to define an endpoint for each of our functional requirements.
Starting with uploading a file, we might have an endpoint like this:
POST /files Request: { File, FileMetadata }
To download a file, our endpoint can be:
GET /files/{fileId} -> File & FileMetadata
Be aware that your APIs may change or evolve as you progress. In this case, our upload and download APIs actually evolve significantly as we weigh the trade-offs of various approaches in our high-level design (more on this later). You can proactively communicate this to your interviewer by saying, "I am going to outline some simple APIs, but may come back and improve them as we delve deeper into the design."
Lastly, to share a file, we might have an endpoint like this:
POST /files/{fileId}/share Request: { User[] // The users to share the file with }
Q: Users should be able to upload a file from any device
The main requirement for a system like Dropbox is to allow users to upload files. When it comes to storing a file, we need to consider two things:
-
Where do we store the file contents (the raw bytes)?
-
Where do we store the file metadata?
For the metadata, we can use a NoSQL database (opens in a new tab) like DynamoDB. DynamoDB (opens in a new tab) is a fully managed NoSQL database hosted by AWS. Our metadata is loosly structured, with few relations and the main query pattern being to fetch files by user. This makes DynamoDB a solid choice, but don't get too caught up in making the right choice here in your interview. The reality is a SQL database like PostgreSQL would work just as well for this use case. Learn more about how to choose the right database (and why it may not matter), here (opens in a new tab).
Our schema will be a simple document and can start with something like this:
{ "id": "123", "name": "file.txt", "size": 1000, "mimeType": "text/plain", "uploadedBy": "user1" }
As for how we store the file itself, we have a few options. Let's take a look at the trade-offs of each.
Bad Solution: Upload File to a Single Server
Good Solution: Store File in Blob Storage
Approach
A better approach is to store the file in a Blob Storage service like Amazon S3 or Google Cloud Storage. When a user uploads a file to our backend, we can send the file directly to Blob Storage and store the metadata in our database. We can store a (virtually) unlimited number of files in Blob Storage as it will handle the scaling for us. It's also more reliable. If our server goes down, we don't lose access to our files. We can also take advantage of Blob Storage features like lifecycle policies to automatically delete old files and versioning to keep track of file changes if needed (though this is out of scope for this problem).
Challenges
One challenge with this approach is that it's more complex. We need to integrate with the Blob Storage service and handle the case where the file is uploaded but the metadata is not saved. We also need to handle the case where the metadata is saved but the file is not uploaded. We can solve these issues by using a transactional approach where we only save the metadata if the file is successfully uploaded and vice versa.
Second, this approach (as depicted above) requires that we technically upload a file twice -- once to our backend and once to Blob Storage. This is redundant. We can solve this issue by allowing the user to upload the file directly to the Blob Storage service.
Great Solution: Upload File Directly to Blob Storage
Approach
The best approach is to allow the user to upload the file directly to Blob Storage from the client. This is faster and cheaper than uploading the file to our backend first. We can use presigned URLs to generate a URL that the user can use to upload the file directly to the Blob Storage service. Once the file is uploaded, the Blob Storage service will send a notification to our backend so we can save the metadata.
Presigned URLs are URLs that give the user permission to upload a file to a specific location in the Blob Storage service. We can generate a presigned URL and send it to the user when they want to upload a file. So whereas our initial API for upload was a POST to /files, it will now be a three step process:
- Request a pre-signed URL from our backend (which itself gets the URL from the Blob Storage service like S3) and save the file metadata in our database with a status of "uploading."
POST /files/presigned-url -> PresignedUrl Request: { FileMetadata }
-
Use the presigned URL to upload the file to Blob Storage directly from the client. This is via a PUT request directly to the presigned URL where the file is the body of the request.
-
Once the file is uploaded, the Blob Storage service will send a notification to our backend using S3 Notifications (opens in a new tab). Our backend will then update the file metadata in our database with a status of "uploaded".
What are Presigned URLs (opens in a new tab)?
Presigned URLs are a feature provided by cloud storage services, such as Amazon S3, that allow temporary access to private resources. These URLs are generated with a specific expiration time, after which they become invalid, offering a secure way to share files without altering permissions. When a presigned URL is created, it includes authentication information as part of the query string, enabling controlled access to otherwise private objects.
This makes them ideal for use cases like temporary file sharing, uploading objects to a bucket without giving users full API access, or providing limited-time access to resources. Presigned URLs can be generated programmatically using the cloud provider's SDK, allowing developers to integrate this functionality into applications seamlessly. This method enhances security by ensuring that sensitive data remains protected while still being accessible to authorized users for a limited period.
Show More
Users should be able to download a file from any device
The next step is making sure users can download their saved files. Just like with uploads, there are a few different ways to approach this.
Bad Solution: Download through File Server
Approach
The most common solution candidates come up with is to download the file once from Blob Storage to our backend server and then once more from our backend to the client.
Challenges
Of course, the solution is suboptimal as we end up downloading the file twice, which is both slow and expensive. We can solve this issue by allowing the user to download the file directly from the Blob Storage service, just like we did with the upload.
Good Solution: Download from Blob Storage
Approach
A better approach is to allow the user to download the file directly from Blob Storage. We can use presigned URLs to generate a URL that the user can use to download the file directly from Blob Storage. Like with uploading, the presigned url will give the user permission to download the file from a specific location in the Blob Storage service for a limited time.
- Request a presigned download URL from our backend
GET /files/{fileId}/presigned-url -> PresignedUrl
- Use the presigned URL to download the file from the Blob Storage service directly to the client.
Challenges
While nearly optimal, the main limitation is that this can still be slow for a large, geographically distributed user base. Your Blob Storage is located in a single region, so users far away from that region will have slower download times. We can solve this issue by using a content delivery network (CDN) (opens in a new tab) to cache the file closer to the user.
Great Solution: Download from CDN
Approach
The best approach is to use a content delivery network (CDN) to cache the file closer to the user. A CDN is a network of servers distributed across the globe that cache files and serve them to users from the server closest to them. This reduces latency and speeds up download times.
When a user requests a file, we can use the CDN to serve the file from the server closest to the user. This is much faster than serving the file from our backend or the Blob Storage service.
For security, just like with our S3 presigned URLs, we can generate a URL that the user can use to download the file from the CDN. This URL will give the user permission to download the file from a specific location in the CDN for a limited time. More on this in our deep dives on security (opens in a new tab).
Challenges
CDNs are relatively expensive. To address this, it is common to be strategic about what files are cached and for how long. We can use a cache control header to specify how long the file should be cached in the CDN. We can also use a cache invalidation mechanism to remove files from the CDN when they are updated or deleted. This way, only files that are frequently accessed are cached and we don't waste money caching files that are rarely accessed.
Users should be able to share a file with other users
To round out the functional requirements, we need to support sharing files with other users. We will implement this similarly to Google Drive, where you just need to enter the email address of the user you want to share the file with. We can assume users are already authenticated.
The main consideration here in an interview is how you can make this process fast and efficient. Let's break it down.
Bad Solution: Add a Sharelist to Metadata
Good Solution: Caching to speed up fetching the Sharelist
Great Solution: Create a separate table for shares
Tying it all together
Let's take a step back and look at our system as a whole. At this point, we have a simple design that satisfies all of our functional requirements.
-
Uploader: This is the client that uploads the file. It could be a web browser, a mobile app, or a desktop app.
-
Downloader: This is the client that downloads the file. Of course, this can be the same client as the uploader, but it doesn't have to be. We separate them in our design for clarity.
-
LB & API Gateway: This is the load balancer and API Gateway that sits in front of our application servers. It's responsible for routing requests to the appropriate server and handling things like SSL termination, rate limiting, and request validation.
-
File Service: The file service is only responsible for writing to and from the file metadata db as well as requesting presigned URLs from S3. It doesn't actually handle the file upload or download. It's just a middleman between the client and S3.
-
File Metadata DB: This is where we store metadata about the files. This includes things like the file name, size, MIME type, and the user who uploaded the file. We also store a shared files table here that maps files to users who have access to them. We use this table to enforce permissions when a user tries to download a file.
-
S3: This is where the files are actually stored. We upload and download files directly to and from S3 using the presigned URLs we get from the file server.
-
CDN: This is a content delivery network that caches files close to the user to reduce latency. We use the CDN to serve files to the downloader.