Sharding the data in your big data storage is often not a trivial problem to solve. Sooner or later you will discover that the sharding schema you used initially may not be the right one long term. Fortunately, we stumbled upon this quite early in our development and decided to redesign our data storage tier before we fill it in with lots of data, which will make the shuffling quite complex.

In our particular case, we started with Microsoft Azure’s DocumentDB, which limits its collection size to 250GB unless you explicitly ask Microsoft for more. DocumentDB provides automatic partitioning of the data but selecting the partition key can still be a challenge, hence you may find the exercise below useful.

The scenario we were trying to plan for was related to our personal finance application allowing users to save receipts information. Briefly, the flow is as follows: user captures receipt with her phone, we convert the receipt to JSON and store the JSON in DocumentDB; users can be part of a group (for example a family can be a group), and the receipts are associated with the group. Here simple relationship model:

zenxpense-data-model

The expectation is that users will upload receipts every day, and they will use the application (we hope:)) for years to come. We did the following estimates for our data growth:

  • We would like to have a sharding schema that can support our needs for the next 5 years; we don’t know what will be our user growth for the next 5 years but as every other startup we hope that this will be something north of 10 million users 🙂
  • A single user or family saves about 50 receipts a week, which will result in approximately 2,500 receipts a year or 12,500 for 5 years
  • Single receipt requires about 10KB storage. We can also store a summary of the receipt, which will require about 250 bytes of storage (but we will still need to store the full receipt somewhere)

Additionally, we don’t need to store the user and group documents in separate collections (i.e. we can put all three in the same collection) but we decided to do so in order to allow easier access to that data for authentication, authorization, and visualization purposes. With all that said we are left to deal with mostly the receipts data that will be growing at a faster pace. Based on the numbers above in a single collection, we can store 25M receipts or 1B summaries. Thus we started looking at different approaches to shard the data.

Using the assumptions above you can easily come up with some staggering numbers. For example for the first year we should project for:

2M users * 2,500 receipts/each * 10KB per receipt = 50TB of storage

Which may even question the choice of DocumentDB (or any other No-SQL database) as the data storage tier. Nevertheless, the point of this post is how to shard the data and what process we went through to do that.

In the below explanations I will use DocumentDB terminology to explain the concepts but you can easily translate this to any other database or storage technology.

Sharding by tenant identifier

One obvious approach for sharding the data is to use the tenant identifier (user_id or group_id in our case) as a key. Thus we will have a single collection where we store the mapping information and multiple collections that will store the receipts for a range of groups. As shown in the picture below, based on group_id we will be able to retrieve the name of the collection where the receipts for this group are stored using the map collection, and then query the resulting collection to retrieve any receipt that belongs to the group.

sharding-by-group-id

Using this approach, though, and taking into account our estimates, each collection will be able to support only 2,000 groups.

2,000 groups * 2,500 receipts/year * 5 years * 10KB = 250GB

Assuming linear growth for our users over 5 years results in 2M users for the first year, which in the best case will be 500K groups (4 users per family for example) or 250 collections. The whole problem is that we will need to create a new collection for every 2000 groups although the previous one is less than 20% full. A bit expensive having in mind that we don’t know what the growth of our user base and the use of our product will be.

A preferred approach would be to create new collection only when the previous one becomes full.

Sharding by timestamp

Because the receipts are submitted over time, another feasible approach would be to shard the data by timestamp. Thus we will end up with a picture similar to the above, however, instead of having group_id as the partition key, we can use the timestamp instead – receipts with timestamps in particular range will be stored in a single partition.

In this case, we would have problems pulling out all the receipts for a particular group but after considering that this is a very rare scenario (but still possible) the trade-off may be warranted. Searching for receipt by properties would also be a challenge though because we will need to scan every collection. For the everyday use, users will request the receipts from the last week or month, which will result in a query to a single collection.

The good side of this approach is that we will need to create new collection only when the previous one is filled in, which means we will not be paying for unused space.

Multi-tier sharding

The previous two approaches assume that there is a single tier for sharding the data. Another approach would be to have two (or more) tiers for sharding the data. In our case, this would look something like this:

multi-tier-sharding

Using this approach we will store the receipt summaries in the first shard tier, which will allow us to save more receipts in a smaller number of collections. We will be able to search by group_id to identify the receipts we need and then pull the full receipt if the user requests it. If we run the numbers it will look something like this for the first year:

2M users -> 500K groups -> 6.25B receipts -> 250 partitions + 7 intermediate partitions

However, we can support 80,000 groups with a single intermediate collection (instead 2000 as in the previous case) and we will fill in both the summary and the full-receipts collections before a new one is created. Also, we will grow the number of collections much slower if our user base grows fast.

The multi-tier sharding approach can also be done using the timestamps or the receipt identifiers as keys for the intermediate collection.

Sharding by receipt identifier

Sharding by receipt_id is obviously the simplest way to shard the data, however, this may not be feasible in a scenario like ours because the receipts are retrieved based on the group_id, and it will result in querying every collection to retrieve all the receipts or find a particular receipt belonging to a group. Well, this is in case the No-SQL provider does not offer automatic partitioning functionality but because DocumentDB does so our problem turned out to be a no problem 🙂 Nevertheless, you need to consider all the implications while choosing the partition key.

As I mentioned above we started with DocumentDB as our choice for storing the data but after running the numbers we may reconsider our choice. DocumentDB is a great choice for storing JSON data and offers amazing features for partitioning and querying it however, looking at our projections the cost of using it may turn out quite high.