Lately I've become interested in graph databases. And over the past week I've been looking at graph database sharding.
tl;dr It is a difficult problem, one that appears to be more complex than RDBMS sharding. A few nice links on sharding relational databases:
- An Unorthodox Approach to Database Design : The Coming of the Shard
- Why you don’t want to shard. – The comments are more interesting than the article
- Shard Lessons
- Sharding for Startups - Eric Reis proposes an URL based sharding scheme
Sharding graph databases is still an active research topic, below I've included a couple of papers and a Stack Overflow discussion on Neo4J sharding:
- Balanced Label Propagation for Partitioning Massive Graphs
- Sharding Social Networks
- Neo4j sharding aspect
In both relational and graph databases, the sharding strategy chosen must reflect the use case being addressed. For relational databases the most general solution involves implementing a "share-nothing" strategy. We split the table that requies sharding into subsets that can be independently manipulated. We should be able to execute all join operations the application requires within a single shard. Sharding is in other words necessarily tied to the requirements of the application.
Graph databases should be approached in a similiar manner to relational databases: avoid sharding if you can. And if sharding is unavoidable, the sharding strategy should be based on the requirements of the application. The most extreme situation arises with scale-free graphs, where a few vertices are much more highly connected than the rest of the vertices in the graph. Even here unless we're doing graph analytics (map-reduce style computations over the whole graph) we should be able to formulate a sharding strategy that works for the given application.
But one of the big draws of graph databases is the promise of efficiently traversing arbitrary paths through the graph. For instance one might be interested in the shortest path between two arbitrary vertices, where the vertices reside in different shards. A different type of problem might occur with graphs so large that all the edges for some of the vertices cannot be maintained on a single machine. This is admittedly far-fetched, given the compute power available today. In these instances we are necessarily executing computations distributed across shards, where network communication might become the limiting factor. There is substantial algorithmic complexity in deciding how to shard such databases. But then such computations are equivalent to executing arbitrary joins on a sharded relational database, which will necessarily spill over the shard boundaries.