The big picture of how Khan Academy development works

Posted on February 6, 2011

If you haven’t heard of Khan Academy yet, you need to start reading more news. I first heard of Khan Academy when they were announced as a winner of Google’s Project 10^100 and have been telling people and tweeting about them ever since. I didn’t start looking into how their development process works until last night though.

Khan Academy is a non profit company started by Salman Khan with the mission of educating the world. Sal himself has created over 2,000 videos on a range of topics from history to mathematics and everything in between. The videos are nothing short of amazing and are broken down into 10 minute chunks which was originally because of the youtube limit imposed on Sal.

Shantanu is the President and COO of the Khan Academy and also has a strong mathematical background like Sal.

Khan Academy has a reputation/energy and badge system in place which makes the site just as addictive as StackOverflow. The badge system is especially cool, offering real time badge awards, something not easily done with a NoSQL implementation and a huge dataset behind the scenes.

Khan Academy is hosted on Google code and uses subversion (SVN) Kiln Hg (They upgraded from SVN) for source control. There are currently 11 committers to the project, the current most active by far (with over a half dozen commits even on a Saturday afternoon) is someone named Ben Kamens (@kamens).

Ben is a previous employee of Joel Spolsky’s company Fog Creek Software, and has a great blog with some interesting insight on how Khan Academy works. He accomplished a lot within just a few months of working at Khan Academy. He also develops a couple of cool iPhone apps one called RulerPhone and the other Precorder.

Khan Academy runs on Google App Engine (GAE) which means they must either use Java or Python 2.5 (Python 2.5 in sandboxed mode also minus the ability to run C extension modules). Khan Academy uses Python 2.5 along with GAE’s default webapp module. Since webapp does not include a template engine, they use the Django 0.96 template engine which the GAE runtime includes by default. As with all GAE applications, the main sitemap is configured via setting URL pattern matching with a YAML configuration file. GAE has a great GAE getting started guide if you are interested. I was.

GAE works off of a datastore which is automatically replicated and scaled and is based on BigTable and hence Google Filesystem (GFS). GAE does not allow for you to host a relational database. Instead of using SQL to tie into the datastore and having write access to your filesystem, you need to use the Google Query Language (GQL). GQL looks exactly like SQL but you can’t do joins and you can’t select partial entities from your queries. You must either select just the keys or the entire entity.

GAE applications such as Khan make use of caching so that the datastore does not need to be contacted on each page load. This caching is typically handled with the memcache service included in GAE API. Typically each model that you have would save to the memcache when you write the model to the datastore, and it would try to retrieve the object from the memcache before getting it from the datastore.

Khan Academy does expose an HTTP JSON API but only for getting a list of playlists and videos per playlist.
It would be great to see additional APIs for read only access to the energy and badge system.

The backup system used by Khan Academy takes around 3 days to complete and is run on an Amazon EC2 instance.
I think this could be improved by doing incremental/differential processes, and using deduplication.

Khan Academy tries to fix all bugs before adding new features, which is a great mantra to have. Other than GAE they use a few very cool Javascript libraries under the hood:

  • jQuery (who doesn’t use jQuery?)
  • ASCIIMathML to formulate math equations, this works by automatically converting any math equation within back tick characters.
  • ASCIIsvg Graphing is accomplished using an iframe which contains generated SVG code (hurray for IE9 finally getting native SVG support)
  • JavaScript InfoVis: Provides tools for creating Interactive Data Visualizations for the Web. Used for the old knowledge map.
  • YUICompressor to compress the Javascript, but better ratios could be accomplished using the Google Closure compiler.
  • Google Maps API v3 is used for the exercise dashboard using a custom map type and some other customizations on the controls and zoom. Another cool aspect is that you are actually zooming around images from the Hubble telescope.
  • Google Analytics is a tracking tool for stats on your visitors
  • Highcharts JS: Interactive JavaScript charts. They use this for user profile charts.
  • Raphaël—JavaScript Library: Used for the scratchpad when doing exercises, and for exercise drawings. Raphaël is a Javascript library for creating SVG graphics, every graphic object is a DOM object which can be manipulated
  • MathJax: Math visualization library for inputs of MathML and LaTeX

HTML5 is used by Khan Academy proved by their HTML doctype declaration; however, in the exercise modules some simple changes could improve the user interface and be compatible across all browsers and platforms. By simply making input boxes like so: <input type="number"> this would mean that all popular mobile phones would display a numeric keypad by default right away. All browsers default to type="text" if the type specified is unknown by old browsers that don’t understand HTML5.

Sal himself started the code but I would imagine most of his time is spent creating the actual content videos, handling press, and doing thousands of other things today. Dean Brettle and Omar Rizwan are also notable developers (sorry if I missed others). Dean amongst other things handles release management, and created the scratch pad used in exercises. Omar has contributed at least 16 exercise modules. Jason Rosoff (Jason’s blog, @jasonrr) is also extremely involved in the project and is known as the lead designer also doing some coding. Marcia Lee (@marcia_lee) is a recent hire and makes frequent commits.

If you are interested in helping with the Khan Academy project you can get started by:

Previous page