Sharding the data in your big data storage is often not a trivial problem to solve. Sooner or later you will discover that the sharding schema you used initially may not be the right one long term. Fortunately, we stumbled upon this quite early in our development and decided to redesign our data storage tier before we fill it in with lots of data, which will make the shuffling quite complex.

In our particular case, we started with Microsoft Azure’s DocumentDB, which limits its collection size to 250GB unless you explicitly ask Microsoft for more. DocumentDB provides automatic partitioning of the data but selecting the partition key can still be a challenge, hence you may find the exercise below useful.

The scenario we were trying to plan for was related to our personal finance application allowing users to save receipts information. Briefly, the flow is as follows: user captures receipt with her phone, we convert the receipt to JSON and store the JSON in DocumentDB; users can be part of a group (for example a family can be a group), and the receipts are associated with the group. Here simple relationship model:

zenxpense-data-model

The expectation is that users will upload receipts every day, and they will use the application (we hope:)) for years to come. We did the following estimates for our data growth:

  • We would like to have a sharding schema that can support our needs for the next 5 years; we don’t know what will be our user growth for the next 5 years but as every other startup we hope that this will be something north of 10 million users 🙂
  • A single user or family saves about 50 receipts a week, which will result in approximately 2,500 receipts a year or 12,500 for 5 years
  • Single receipt requires about 10KB storage. We can also store a summary of the receipt, which will require about 250 bytes of storage (but we will still need to store the full receipt somewhere)

Additionally, we don’t need to store the user and group documents in separate collections (i.e. we can put all three in the same collection) but we decided to do so in order to allow easier access to that data for authentication, authorization, and visualization purposes. With all that said we are left to deal with mostly the receipts data that will be growing at a faster pace. Based on the numbers above in a single collection, we can store 25M receipts or 1B summaries. Thus we started looking at different approaches to shard the data.

Using the assumptions above you can easily come up with some staggering numbers. For example for the first year we should project for:

2M users * 2,500 receipts/each * 10KB per receipt = 50TB of storage

Which may even question the choice of DocumentDB (or any other No-SQL database) as the data storage tier. Nevertheless, the point of this post is how to shard the data and what process we went through to do that.

In the below explanations I will use DocumentDB terminology to explain the concepts but you can easily translate this to any other database or storage technology.

Sharding by tenant identifier

One obvious approach for sharding the data is to use the tenant identifier (user_id or group_id in our case) as a key. Thus we will have a single collection where we store the mapping information and multiple collections that will store the receipts for a range of groups. As shown in the picture below, based on group_id we will be able to retrieve the name of the collection where the receipts for this group are stored using the map collection, and then query the resulting collection to retrieve any receipt that belongs to the group.

sharding-by-group-id

Using this approach, though, and taking into account our estimates, each collection will be able to support only 2,000 groups.

2,000 groups * 2,500 receipts/year * 5 years * 10KB = 250GB

Assuming linear growth for our users over 5 years results in 2M users for the first year, which in the best case will be 500K groups (4 users per family for example) or 250 collections. The whole problem is that we will need to create a new collection for every 2000 groups although the previous one is less than 20% full. A bit expensive having in mind that we don’t know what the growth of our user base and the use of our product will be.

A preferred approach would be to create new collection only when the previous one becomes full.

Sharding by timestamp

Because the receipts are submitted over time, another feasible approach would be to shard the data by timestamp. Thus we will end up with a picture similar to the above, however, instead of having group_id as the partition key, we can use the timestamp instead – receipts with timestamps in particular range will be stored in a single partition.

In this case, we would have problems pulling out all the receipts for a particular group but after considering that this is a very rare scenario (but still possible) the trade-off may be warranted. Searching for receipt by properties would also be a challenge though because we will need to scan every collection. For the everyday use, users will request the receipts from the last week or month, which will result in a query to a single collection.

The good side of this approach is that we will need to create new collection only when the previous one is filled in, which means we will not be paying for unused space.

Multi-tier sharding

The previous two approaches assume that there is a single tier for sharding the data. Another approach would be to have two (or more) tiers for sharding the data. In our case, this would look something like this:

multi-tier-sharding

Using this approach we will store the receipt summaries in the first shard tier, which will allow us to save more receipts in a smaller number of collections. We will be able to search by group_id to identify the receipts we need and then pull the full receipt if the user requests it. If we run the numbers it will look something like this for the first year:

2M users -> 500K groups -> 6.25B receipts -> 250 partitions + 7 intermediate partitions

However, we can support 80,000 groups with a single intermediate collection (instead 2000 as in the previous case) and we will fill in both the summary and the full-receipts collections before a new one is created. Also, we will grow the number of collections much slower if our user base grows fast.

The multi-tier sharding approach can also be done using the timestamps or the receipt identifiers as keys for the intermediate collection.

Sharding by receipt identifier

Sharding by receipt_id is obviously the simplest way to shard the data, however, this may not be feasible in a scenario like ours because the receipts are retrieved based on the group_id, and it will result in querying every collection to retrieve all the receipts or find a particular receipt belonging to a group. Well, this is in case the No-SQL provider does not offer automatic partitioning functionality but because DocumentDB does so our problem turned out to be a no problem 🙂 Nevertheless, you need to consider all the implications while choosing the partition key.

As I mentioned above we started with DocumentDB as our choice for storing the data but after running the numbers we may reconsider our choice. DocumentDB is a great choice for storing JSON data and offers amazing features for partitioning and querying it however, looking at our projections the cost of using it may turn out quite high.

You may be wondering, why I chose Python as the language to teach you software engineering practices? There are tons of other languages one can use for that purpose, languages that are much sexier than Python. Well, I certainly have my reasons, and here is a summary:

  • First of all, Python is very easy language to learn, which make it a good choice for beginners
  • Python is an interpretive programming language, which means that you receive immediate feedback from the commands you type
  • Python supports both, functional as well as object-oriented approaches to programming, which is good if you don’t know what path you want to choose
  • Python is a versatile language that can be used to develop all kinds of applications, hence it is used by people in various roles. Here some:
    • Front-end developers can use it to implement dynamic functionality on websites
    • Back-end developers can use it to implement cloud-based services, APIs and communicate with other services
    • IT people can use it to develop infrastructure, application deployment and all kinds of other automation
    • Data scientists can use it to create data models, parse data or implement machine learning algorithms

As you can see Python is a programming language that, if you become good at it, can enable multiple paths for your career. Learning the language as well as establishing good development habits will open many doors for you.

For the past twenty or so years, since I started my career in technology in 1996, almost every book I read about programming, although providing detailed coverage of the particular programming language the book was written about, lacked crucial information educating the reader how to become good Software Engineer. Learning a programming language from such a book is like learning the syntax and the grammar of a foreign language but never understanding the traditions of the native speakers, the idioms they use as well as how to express yourself without offending them. Yes, you can speak the language, but you will need a lot of work to do before you start to fit in.

Learning the letters, the words and how to construct a sentence is just a small part of learning a new language. This is also true for programming languages. Knowing the syntax, the data types, and the control structures will not make you a good software engineer. It is surprising to me that so many books and college classes concentrate only on those things while neglecting fundamental topics like how to design an application, how to write maintainable and performant code, how to debug, troubleshoot, package or distribute it. The lack of understanding in those areas makes new programmers not only inefficient but also establishes bad habits that are hard to change later on.

I’ve seen thousands and thousands of lines of undocumented code, whole applications that log no errors, and nobody can figure out where they break, web pages that take 20 mins to load, and plain silly code that calls a function to sum two numbers (something that can be achieved simply with a plus sign). Hence I decided to write a book that not only explains the Python language in simple and understandable approach but also teaches the fundamental practices of software engineering. Book that will, after reading it, have you ready to jump in and develop high-quality, performant and maintainable code that meets the requirements of your customers. Book, that any person can take, and learn how to become Software Engineer.

I intentionally use the term Software Engineer because I want to emphasize that developing high-quality software involves a lot more than just writing code. I wanted to write a book that will prepare you to be a Software Engineer, and not simply a Coder or Programmer. I hope that with this book I achieved this goal and helped you, the reader, to advance your career.