person-drawing-data-relations

Sharding the data in your big data storage is often not a trivial problem to solve. Sooner or later you will discover that the sharding schema you used initially may not be the right one long term. Fortunately, we stumbled upon this quite early in our development and decided to redesign our data storage tier before we fill it in with lots of data, which will make the shuffling quite complex.

In our particular case, we started with Microsoft Azure’s DocumentDB, which limits its collection size to 250GB unless you explicitly ask Microsoft for more. DocumentDB provides automatic partitioning of the data but selecting the partition key can still be a challenge, hence you may find the exercise below useful.

The scenario we were trying to plan for was related to our personal finance application allowing users to save receipts information. Briefly, the flow is as follows: user captures receipt with her phone, we convert the receipt to JSON and store the JSON in DocumentDB; users can be part of a group (for example a family can be a group), and the receipts are associated with the group. Here simple relationship model:

zenxpense-data-model

The expectation is that users will upload receipts every day, and they will use the application (we hope:)) for years to come. We did the following estimates for our data growth:

  • We would like to have a sharding schema that can support our needs for the next 5 years; we don’t know what will be our user growth for the next 5 years but as every other startup we hope that this will be something north of 10 million users 🙂
  • A single user or family saves about 50 receipts a week, which will result in approximately 2,500 receipts a year or 12,500 for 5 years
  • Single receipt requires about 10KB storage. We can also store a summary of the receipt, which will require about 250 bytes of storage (but we will still need to store the full receipt somewhere)

Additionally, we don’t need to store the user and group documents in separate collections (i.e. we can put all three in the same collection) but we decided to do so in order to allow easier access to that data for authentication, authorization, and visualization purposes. With all that said we are left to deal with mostly the receipts data that will be growing at a faster pace. Based on the numbers above in a single collection, we can store 25M receipts or 1B summaries. Thus we started looking at different approaches to shard the data.

Using the assumptions above you can easily come up with some staggering numbers. For example for the first year we should project for:

2M users * 2,500 receipts/each * 10KB per receipt = 50TB of storage

Which may even question the choice of DocumentDB (or any other No-SQL database) as the data storage tier. Nevertheless, the point of this post is how to shard the data and what process we went through to do that.

In the below explanations I will use DocumentDB terminology to explain the concepts but you can easily translate this to any other database or storage technology.

Sharding by tenant identifier

One obvious approach for sharding the data is to use the tenant identifier (user_id or group_id in our case) as a key. Thus we will have a single collection where we store the mapping information and multiple collections that will store the receipts for a range of groups. As shown in the picture below, based on group_id we will be able to retrieve the name of the collection where the receipts for this group are stored using the map collection, and then query the resulting collection to retrieve any receipt that belongs to the group.

sharding-by-group-id

Using this approach, though, and taking into account our estimates, each collection will be able to support only 2,000 groups.

2,000 groups * 2,500 receipts/year * 5 years * 10KB = 250GB

Assuming linear growth for our users over 5 years results in 2M users for the first year, which in the best case will be 500K groups (4 users per family for example) or 250 collections. The whole problem is that we will need to create a new collection for every 2000 groups although the previous one is less than 20% full. A bit expensive having in mind that we don’t know what the growth of our user base and the use of our product will be.

A preferred approach would be to create new collection only when the previous one becomes full.

Sharding by timestamp

Because the receipts are submitted over time, another feasible approach would be to shard the data by timestamp. Thus we will end up with a picture similar to the above, however, instead of having group_id as the partition key, we can use the timestamp instead – receipts with timestamps in particular range will be stored in a single partition.

In this case, we would have problems pulling out all the receipts for a particular group but after considering that this is a very rare scenario (but still possible) the trade-off may be warranted. Searching for receipt by properties would also be a challenge though because we will need to scan every collection. For the everyday use, users will request the receipts from the last week or month, which will result in a query to a single collection.

The good side of this approach is that we will need to create new collection only when the previous one is filled in, which means we will not be paying for unused space.

Multi-tier sharding

The previous two approaches assume that there is a single tier for sharding the data. Another approach would be to have two (or more) tiers for sharding the data. In our case, this would look something like this:

multi-tier-sharding

Using this approach we will store the receipt summaries in the first shard tier, which will allow us to save more receipts in a smaller number of collections. We will be able to search by group_id to identify the receipts we need and then pull the full receipt if the user requests it. If we run the numbers it will look something like this for the first year:

2M users -> 500K groups -> 6.25B receipts -> 250 partitions + 7 intermediate partitions

However, we can support 80,000 groups with a single intermediate collection (instead 2000 as in the previous case) and we will fill in both the summary and the full-receipts collections before a new one is created. Also, we will grow the number of collections much slower if our user base grows fast.

The multi-tier sharding approach can also be done using the timestamps or the receipt identifiers as keys for the intermediate collection.

Sharding by receipt identifier

Sharding by receipt_id is obviously the simplest way to shard the data, however, this may not be feasible in a scenario like ours because the receipts are retrieved based on the group_id, and it will result in querying every collection to retrieve all the receipts or find a particular receipt belonging to a group. Well, this is in case the No-SQL provider does not offer automatic partitioning functionality but because DocumentDB does so our problem turned out to be a no problem 🙂 Nevertheless, you need to consider all the implications while choosing the partition key.

As I mentioned above we started with DocumentDB as our choice for storing the data but after running the numbers we may reconsider our choice. DocumentDB is a great choice for storing JSON data and offers amazing features for partitioning and querying it however, looking at our projections the cost of using it may turn out quite high.

python-code

You may be wondering, why I chose Python as the language to teach you software engineering practices? There are tons of other languages one can use for that purpose, languages that are much sexier than Python. Well, I certainly have my reasons, and here is a summary:

  • First of all, Python is very easy language to learn, which make it a good choice for beginners
  • Python is an interpretive programming language, which means that you receive immediate feedback from the commands you type
  • Python supports both, functional as well as object-oriented approaches to programming, which is good if you don’t know what path you want to choose
  • Python is a versatile language that can be used to develop all kinds of applications, hence it is used by people in various roles. Here some:
    • Front-end developers can use it to implement dynamic functionality on websites
    • Back-end developers can use it to implement cloud-based services, APIs and communicate with other services
    • IT people can use it to develop infrastructure, application deployment and all kinds of other automation
    • Data scientists can use it to create data models, parse data or implement machine learning algorithms

As you can see Python is a programming language that, if you become good at it, can enable multiple paths for your career. Learning the language as well as establishing good development habits will open many doors for you.

programming-book

For the past twenty or so years, since I started my career in technology in 1996, almost every book I read about programming, although providing detailed coverage of the particular programming language the book was written about, lacked crucial information educating the reader how to become good Software Engineer. Learning a programming language from such a book is like learning the syntax and the grammar of a foreign language but never understanding the traditions of the native speakers, the idioms they use as well as how to express yourself without offending them. Yes, you can speak the language, but you will need a lot of work to do before you start to fit in.

Learning the letters, the words and how to construct a sentence is just a small part of learning a new language. This is also true for programming languages. Knowing the syntax, the data types, and the control structures will not make you a good software engineer. It is surprising to me that so many books and college classes concentrate only on those things while neglecting fundamental topics like how to design an application, how to write maintainable and performant code, how to debug, troubleshoot, package or distribute it. The lack of understanding in those areas makes new programmers not only inefficient but also establishes bad habits that are hard to change later on.

I’ve seen thousands and thousands of lines of undocumented code, whole applications that log no errors, and nobody can figure out where they break, web pages that take 20 mins to load, and plain silly code that calls a function to sum two numbers (something that can be achieved simply with a plus sign). Hence I decided to write a book that not only explains the Python language in simple and understandable approach but also teaches the fundamental practices of software engineering. Book that will, after reading it, have you ready to jump in and develop high-quality, performant and maintainable code that meets the requirements of your customers. Book, that any person can take, and learn how to become Software Engineer.

I intentionally use the term Software Engineer because I want to emphasize that developing high-quality software involves a lot more than just writing code. I wanted to write a book that will prepare you to be a Software Engineer, and not simply a Coder or Programmer. I hope that with this book I achieved this goal and helped you, the reader, to advance your career.

komodo-edit-10-ide

With our first full time developer on board I had to put some structure around the tools and services we will use to manage our work. In general I don’t want to be too prescriptive on what tools they should use to get the job done but it will be good to put some guidelines for the tool set and outline the mandatory and optional ones. For our development we’ve made the following choices:

  • Microsoft Azure as Cloud Provider
  • TornadoWeb and Python 2.7 as a runtime for our APIs and frontend
  • DocumentDB and Azure storage for our storage tier
  • Azure Machine Learning and Microsoft Cognitive Services for machine learning

Well, those are the mandatory things but as I mentioned in my previous post How to Build a Great Software Development Team?, software development is more than just technology. Nevertheless we had to decide on a toolset to at least start with, so here is the list:

1. Slack

My first impression of Slack was lukewarm, and I preferred the more conservative UI of HipChat. However compared to HipChat, Slack offered multiple teams capability right from the beginning, which allowed me to communicate not only with my team but use it for communication at client site as well as with the advisory team for North Seattle College. In addition HipChat introduced quite a few bugs in their latest versions, which made the team communication quite unreliable and non-productive, and this totally swayed the decision to go with Slack. After some time I got used to Slack’s UI and started linking it, and now it is an integral part of our team’s communication.

2. Outlook 2016

For my personal email I use Google Apps with custom domain however I’ve been long time Outlook user and with the introduction of Office 365 I think the value for the money is in Microsoft’s benefits. Managing multiple email accounts and calendars, scheduling in-person or online meetings using the GoToMeeting and Skype for Business plugins is a snap with Outlook. With the added benefit of using Word, Excel and PowerPoint as part of the subscription, Office 365 is a no-brainer. We use Office 365 E3, which gives each one of us full set of Office capabilities.

3. Dropbox

Sending files via email is an archaic approach, although I see that still being widely done. For that purpose we have set up Dropbox for the team. I have created shared folders for the leadership team as well as each one of the team members, allowing them to easily share files between each other. For the leadership team we settled on Dropbox Pro for the leadership team and the Free Dropbox for the team members. In the future we are considering to move to the Business Edition.

4. Komodo Edit

I have been a long-time fan of Komodo. It is a very lightweight IDE that offers highlighting and type-assist for number of programming languages like Python, HTML5, JavaScript and CSS3. It also allowing you to extend the functionality with third party plugins offering rich capabilities. I use it for most of my development.

5. Visual Studio Code

Visual Studio Code is the new cross-platform IDE from Microsoft. It is a lightweight IDE similar to Sublime Text, and offers lot of nice features that can be very helpful if you develop for Azure. It has built-in debugging, Intellisense and has a plugins extensibility model with growing number of plugin vendors. Great tool for creating mark-down documents, debugging with breakpoints from within IDE and more. Visual Studio Code is an alternative to Visual Studio that allows you to develop for Azure on platforms other than Windows. If you are Visual Studio fan but don’t want to pay hefty amount of money you can give Visual Studio Community Edition a try (unfortunately available for Windows only). Here is a Visual Studio Editions comparison chart that you may find useful.

6. Visual Studio Online

Managing the development project is crucial for the success of your team. The functionality that Visual Studio Online offers for keeping backlogs, tracking sprint work items and reporting is comparable if not better than Jira, and if you are bound to the Microsoft ecosystem it is the obvious choice. For our small team we leverage almost completely the free edition and it gives us all the functionality we need to manage the work.

7. Docker

Being able to deploy a complete development environment with the click of a button is crucial for the development productivity. Creating Docker Compose template consisting of two TornadoWeb workers and NGINX load-balancer in front (very similar configuration to what we plan to use in Production) is less than an hour task with Docker, and reduces the operational overhead for developers multiple times. Not only that but also completely mimics the production configuration, which means the probability of introducing bugs caused by environment differences is practically zero.

With the introduction of Docker for Windows all the above became much easier to do on Windows Desktop, which is an added benefit.

8. Draw.IO

Last but not least being able to visually communicate your application or system design is essential for successful development projects. For that purpose we use Draw.IO. In addition to the standard block diagrams and flowcharts it offers Azure and AWS specific diagrams, creation of UI mockups, and even UML if you want to go so far.

Armed with the above set of tools you are well prepared to move fast with your development project on a lean budget.

python-expense-sample-github

For awhile I have been looking for a good sample application in Python that I can use for training purposes. Majority of the sample applications available online cover certain topic like data structures or string manipulation, but so far I have not found one that has more holistic approach. For Basic Python Developer Training I would like to use a real-life application that covers various areas from the language syntax and structures, but can also teach good software development practices. There are minimum requirements for a Software Developer that I believe need to be taught in Basic Development Classes, and the projects used in such classes need to make sure that those minimum requirements are met.

For our new developers training I decided to use a simple Expense Reports application with very basic requirements:

  • I should be able to store receipts information into a file
  • The following information about the receipt should be stored
    • Date
    • Store
    • Amount
    • Tags
  • I should be able to generate a report for my expenses based on the following information
    • Date range
    • Store
    • Tags

My goal with this application is to teach junior developers few things:

  • Python Language Concepts like data types, control structures etc. as well as a bit more complex concepts like data structures, data manipulation, data conversion, file input and output and so on
  • Code Maintainability Practices like naming conventions, comments and code documentation, modularity etc.
  • Basic Application Design including requirements analysis and clarification
  • Basic User Experience concepts like UI designs, user prompts, input feedback, output formatting etc.
  • Application Reliability including error and exception handling
  • Testing that includes unit and functional testing
  • Troubleshooting that includes debugging and logging
  • Interoperability for interactions with other applications
  • Delivery including packaging and distribution

I have started a Python Expenses Sample Github Project for the application, where I will check-in the code from the weekly classes as well as instructions how to use those.

 

software-development-mindmap

We are looking to hire a few interns for the summer, and this made me thinking what approach should I take to provide great experience for them as well as get out some value for us. The culture that I am trying to establish in my company is based on the premise that software development is more than writing code. I know that this is a overused cliche but if you look around there are thousands of educational institutions and private companies that concentrate on teaching just that – how to write code while neglecting everything else that is involved in software development.

For that purpose I decided to create a crash course of good software development practices and just run our new hires through it. Being involved in quite a few technology projects over the last 20+ years, and having seen lot of failures and successes I have developed my own approach that may or may not fit the latest trends but has served me well. Also, having managed one of the best performing teams in the Microsoft Windows Division during Windows 7 (yes, I can claim that :)) I can say that I also have some experience with building great teams.

So, my goal for our interns is at the end of the summer to be real software developers, and for that experience they will get paid instead spend money. Now, here are the things that I want them to know at the end of the summer:

Team

The team is the most important part of software development. The few important things that I want to teach them are that they need to work together, help each other, solve problems together, and NOT throw things over the fence because this is not their area of responsibility. If they learn  this I will accomplish my job as their mentor (I am kidding 🙂 but yes, I think there are too many broken teams around the world).

As a software developer they are responsible for the product and the customer experience, doesn’t matter whether they write the SQL scripts, or the APIs or the client UI. If there is a problem with the code they need to work with their peers to troubleshoot and solve the problem. If one of their peers has a difficulty implementing something and they know the answer they should help them move to the next level, and not keep it for themselves because they are scared that she or he will take their job.

And one more thing – politics are strictly forbidden! 

Communication

Communication is a key. The first thing I standardize in each one of my projects is the communication channels for the team. And this is not as simple as using Slack for everything but regular meetings, who manages the meetings, where documents are saved, what are the naming conventions for folders, files etc., when to use what tool (Slack, email, others) and so on.

Being able to effectively communicate does not mean strictly defining organizational hierarchies, it means keeping everyone informed and being transparent.

Development Process

As a friend of mine once said: “Try to build a car with agile!” We always jump to the latest and greatest but often forget the good parts from the past. Yes, I will teach them agile – Scrum or Kanban – doesn’t really matter, important is that they feel comfortable with their tasks and are able to deliver. And, don’t forget – there will be design tasks for everything. This brings me to:

Software Design

Software Design is integral part of the software development. There are few parts of the design that I put emphasis on:

  • User Interface (UI) design
    They need to be able to understand what purpose the wire-frames play; what is redlining; when do we need one or the other; what are good UI design patterns and how to find those and so on
  • Systems design
    They need to understand how each individual system interacts with the rest, how each system is configured and how their implementation is impacted
  • Software components design
    Every piece of software has dependencies, and they need to learn how to map those dependencies, how different components interact with each other, and where things can break. Things like libraries, packaging, code repository structure etc. all play role here

Testing

The best way to learn good development practices is to test your own software. Yes, this is correct – eat your own dogfood! And I am not talking about testing just the piece of code you wrote (the so called unit testing) but the whole experience you worked on as well as the whole product.

By learning how better to test their code my developers will not only see the results of their work but next time will be more cognizant of the experience they are developing and how can they improve it.

Build and Deployment

Manually building infrastructure and deploying code is painful and waste of time. I want my software developers to think about automated deployment from the beginning. As a small company we cannot afford to have dedicated Operations team, whose only job is to copy bits from one place to another and build environments.

Using tools like Puppet, Chef, Ansible or Salt is not new to the world but having people manually create virtual machines still seems to very common. Learning how to automate their work will allow them to deliver more in less time and become better performers.

Operations

Operating the application is the final step in the software development. Being able to easily troubleshoot issues and fix bugs is crucial for the success of every development team. Incorporating things like logging, performance metrics, analytical data and so on from the beginning is something I want my developers to learn at the start.

One of the areas I particularly put emphasis on is the Business Insights (BI) part of software development. Incorporating code that collects usage information will not only help them better understand the features they implemented but most importantly will prevent them from writing things nobody uses.

The list above is a very rough plan for the crash course I plan for our interns. As it progresses I will post more details how it goes, what they learned, what tools we use and so on. I started sketching the things on the mindmap above and it is growing pretty fast.

It will be interesting experience 🙂