Sharding the data in your big data storage is often not a trivial problem to solve. Sooner or later you will discover that the sharding schema you used initially may not be the right one long term. Fortunately, we stumbled upon this quite early in our development and decided to redesign our data storage tier before we fill it in with lots of data, which will make the shuffling quite complex.

In our particular case, we started with Microsoft Azure’s DocumentDB, which limits its collection size to 250GB unless you explicitly ask Microsoft for more. DocumentDB provides automatic partitioning of the data but selecting the partition key can still be a challenge, hence you may find the exercise below useful.

The scenario we were trying to plan for was related to our personal finance application allowing users to save receipts information. Briefly, the flow is as follows: user captures receipt with her phone, we convert the receipt to JSON and store the JSON in DocumentDB; users can be part of a group (for example a family can be a group), and the receipts are associated with the group. Here simple relationship model:

zenxpense-data-model

The expectation is that users will upload receipts every day, and they will use the application (we hope:)) for years to come. We did the following estimates for our data growth:

  • We would like to have a sharding schema that can support our needs for the next 5 years; we don’t know what will be our user growth for the next 5 years but as every other startup we hope that this will be something north of 10 million users 🙂
  • A single user or family saves about 50 receipts a week, which will result in approximately 2,500 receipts a year or 12,500 for 5 years
  • Single receipt requires about 10KB storage. We can also store a summary of the receipt, which will require about 250 bytes of storage (but we will still need to store the full receipt somewhere)

Additionally, we don’t need to store the user and group documents in separate collections (i.e. we can put all three in the same collection) but we decided to do so in order to allow easier access to that data for authentication, authorization, and visualization purposes. With all that said we are left to deal with mostly the receipts data that will be growing at a faster pace. Based on the numbers above in a single collection, we can store 25M receipts or 1B summaries. Thus we started looking at different approaches to shard the data.

Using the assumptions above you can easily come up with some staggering numbers. For example for the first year we should project for:

2M users * 2,500 receipts/each * 10KB per receipt = 50TB of storage

Which may even question the choice of DocumentDB (or any other No-SQL database) as the data storage tier. Nevertheless, the point of this post is how to shard the data and what process we went through to do that.

In the below explanations I will use DocumentDB terminology to explain the concepts but you can easily translate this to any other database or storage technology.

Sharding by tenant identifier

One obvious approach for sharding the data is to use the tenant identifier (user_id or group_id in our case) as a key. Thus we will have a single collection where we store the mapping information and multiple collections that will store the receipts for a range of groups. As shown in the picture below, based on group_id we will be able to retrieve the name of the collection where the receipts for this group are stored using the map collection, and then query the resulting collection to retrieve any receipt that belongs to the group.

sharding-by-group-id

Using this approach, though, and taking into account our estimates, each collection will be able to support only 2,000 groups.

2,000 groups * 2,500 receipts/year * 5 years * 10KB = 250GB

Assuming linear growth for our users over 5 years results in 2M users for the first year, which in the best case will be 500K groups (4 users per family for example) or 250 collections. The whole problem is that we will need to create a new collection for every 2000 groups although the previous one is less than 20% full. A bit expensive having in mind that we don’t know what the growth of our user base and the use of our product will be.

A preferred approach would be to create new collection only when the previous one becomes full.

Sharding by timestamp

Because the receipts are submitted over time, another feasible approach would be to shard the data by timestamp. Thus we will end up with a picture similar to the above, however, instead of having group_id as the partition key, we can use the timestamp instead – receipts with timestamps in particular range will be stored in a single partition.

In this case, we would have problems pulling out all the receipts for a particular group but after considering that this is a very rare scenario (but still possible) the trade-off may be warranted. Searching for receipt by properties would also be a challenge though because we will need to scan every collection. For the everyday use, users will request the receipts from the last week or month, which will result in a query to a single collection.

The good side of this approach is that we will need to create new collection only when the previous one is filled in, which means we will not be paying for unused space.

Multi-tier sharding

The previous two approaches assume that there is a single tier for sharding the data. Another approach would be to have two (or more) tiers for sharding the data. In our case, this would look something like this:

multi-tier-sharding

Using this approach we will store the receipt summaries in the first shard tier, which will allow us to save more receipts in a smaller number of collections. We will be able to search by group_id to identify the receipts we need and then pull the full receipt if the user requests it. If we run the numbers it will look something like this for the first year:

2M users -> 500K groups -> 6.25B receipts -> 250 partitions + 7 intermediate partitions

However, we can support 80,000 groups with a single intermediate collection (instead 2000 as in the previous case) and we will fill in both the summary and the full-receipts collections before a new one is created. Also, we will grow the number of collections much slower if our user base grows fast.

The multi-tier sharding approach can also be done using the timestamps or the receipt identifiers as keys for the intermediate collection.

Sharding by receipt identifier

Sharding by receipt_id is obviously the simplest way to shard the data, however, this may not be feasible in a scenario like ours because the receipts are retrieved based on the group_id, and it will result in querying every collection to retrieve all the receipts or find a particular receipt belonging to a group. Well, this is in case the No-SQL provider does not offer automatic partitioning functionality but because DocumentDB does so our problem turned out to be a no problem 🙂 Nevertheless, you need to consider all the implications while choosing the partition key.

As I mentioned above we started with DocumentDB as our choice for storing the data but after running the numbers we may reconsider our choice. DocumentDB is a great choice for storing JSON data and offers amazing features for partitioning and querying it however, looking at our projections the cost of using it may turn out quite high.

With our first full time developer on board I had to put some structure around the tools and services we will use to manage our work. In general I don’t want to be too prescriptive on what tools they should use to get the job done but it will be good to put some guidelines for the tool set and outline the mandatory and optional ones. For our development we’ve made the following choices:

  • Microsoft Azure as Cloud Provider
  • TornadoWeb and Python 2.7 as a runtime for our APIs and frontend
  • DocumentDB and Azure storage for our storage tier
  • Azure Machine Learning and Microsoft Cognitive Services for machine learning

Well, those are the mandatory things but as I mentioned in my previous post How to Build a Great Software Development Team?, software development is more than just technology. Nevertheless we had to decide on a toolset to at least start with, so here is the list:

1. Slack

My first impression of Slack was lukewarm, and I preferred the more conservative UI of HipChat. However compared to HipChat, Slack offered multiple teams capability right from the beginning, which allowed me to communicate not only with my team but use it for communication at client site as well as with the advisory team for North Seattle College. In addition HipChat introduced quite a few bugs in their latest versions, which made the team communication quite unreliable and non-productive, and this totally swayed the decision to go with Slack. After some time I got used to Slack’s UI and started linking it, and now it is an integral part of our team’s communication.

2. Outlook 2016

For my personal email I use Google Apps with custom domain however I’ve been long time Outlook user and with the introduction of Office 365 I think the value for the money is in Microsoft’s benefits. Managing multiple email accounts and calendars, scheduling in-person or online meetings using the GoToMeeting and Skype for Business plugins is a snap with Outlook. With the added benefit of using Word, Excel and PowerPoint as part of the subscription, Office 365 is a no-brainer. We use Office 365 E3, which gives each one of us full set of Office capabilities.

3. Dropbox

Sending files via email is an archaic approach, although I see that still being widely done. For that purpose we have set up Dropbox for the team. I have created shared folders for the leadership team as well as each one of the team members, allowing them to easily share files between each other. For the leadership team we settled on Dropbox Pro for the leadership team and the Free Dropbox for the team members. In the future we are considering to move to the Business Edition.

4. Komodo Edit

I have been a long-time fan of Komodo. It is a very lightweight IDE that offers highlighting and type-assist for number of programming languages like Python, HTML5, JavaScript and CSS3. It also allowing you to extend the functionality with third party plugins offering rich capabilities. I use it for most of my development.

5. Visual Studio Code

Visual Studio Code is the new cross-platform IDE from Microsoft. It is a lightweight IDE similar to Sublime Text, and offers lot of nice features that can be very helpful if you develop for Azure. It has built-in debugging, Intellisense and has a plugins extensibility model with growing number of plugin vendors. Great tool for creating mark-down documents, debugging with breakpoints from within IDE and more. Visual Studio Code is an alternative to Visual Studio that allows you to develop for Azure on platforms other than Windows. If you are Visual Studio fan but don’t want to pay hefty amount of money you can give Visual Studio Community Edition a try (unfortunately available for Windows only). Here is a Visual Studio Editions comparison chart that you may find useful.

6. Visual Studio Online

Managing the development project is crucial for the success of your team. The functionality that Visual Studio Online offers for keeping backlogs, tracking sprint work items and reporting is comparable if not better than Jira, and if you are bound to the Microsoft ecosystem it is the obvious choice. For our small team we leverage almost completely the free edition and it gives us all the functionality we need to manage the work.

7. Docker

Being able to deploy a complete development environment with the click of a button is crucial for the development productivity. Creating Docker Compose template consisting of two TornadoWeb workers and NGINX load-balancer in front (very similar configuration to what we plan to use in Production) is less than an hour task with Docker, and reduces the operational overhead for developers multiple times. Not only that but also completely mimics the production configuration, which means the probability of introducing bugs caused by environment differences is practically zero.

With the introduction of Docker for Windows all the above became much easier to do on Windows Desktop, which is an added benefit.

8. Draw.IO

Last but not least being able to visually communicate your application or system design is essential for successful development projects. For that purpose we use Draw.IO. In addition to the standard block diagrams and flowcharts it offers Azure and AWS specific diagrams, creation of UI mockups, and even UML if you want to go so far.

Armed with the above set of tools you are well prepared to move fast with your development project on a lean budget.