GUIDs as Primary Keys or How to Walk Into an Argument

maanantaina, tammikuuta 12, 2009

I generally work on two kinds of projects: the big enterprise web application and the tiny startup web application. The requirements and processes for building these two projects are distinct. For example, the database in a big project could easily be Oracle or DB2 while a tiny or mid-size project may run on MySQL or SQL Server. Other differences include hardware and staffing resources.
Before we get too far, I should make a note about the above databases. I personally have yet to encounter a project that cannot be implemented using MySQL. I have built analytical web applications running hundreds of millions of rows in a single table and it worked and performed well. There are limitations in tuning but generally I have found some level of snobbery with the DBA’s thinking that MySQL is just a toy DB. A word of caution to them is that Toyota was once seen as a mediocre, unreliable, rusty piece of junk and now it is considered one of the most reliable and financially stable auto makers in the world. Things really do change over time. Back to the main subject.
Big projects often have larger budgets, more resources, more requirements and often have a longer development cycle. That longer development cycle may include more time to understand the data model and the production environment and thus produce a executable design.
Small projects often have a small budget, fewer resources and may have much less planning. These projects may also be cash flow constrained, meaning that you want to spend as little per month as possible while still delivering a service that customers will buy.
Databases and primary keys come into play when you are working with a small project that may become big and outgrow its hosting needs. Moving a physical server is relatively easy provided that you can schedule some downtime. You may end up having to replicate the database on a new server and bring up the new server at the new hosting site. There will almost certainly be some data reconciliation process after the transition form the old server to the new.
A bigger problem occurs when you are really small and trying to go from one web provider to another. You may have a Virtual Private Server (VPS) at one hosting provider and have to migrate to a dedicated server with a new provider. This migration process is more difficult and requires more downtime.
Its more difficult in that replication is not an option. Some will argue that database replication is tricky unto itself. However, to me there are only a few things more painful than exporting a database and then importing it somewhere else. The easiest migrations used a GUID as the row ID. This is especially true when I’ve had to merge two or more active databases.
I was planning on writing extensively about the virtues of using a GUID as a primary key and then I read the following posts:
   1. Brian Aker, "Brian "Krow" Aker's Idle Thoughts - Myths, GUID vs Autoincrement,"
   2. Jeff Atwood, "Coding Horror: Primary Keys: IDs versus GUIDs,"
   3. Mike Malone, "Database Design: Choosing a Primary Key - I'm Mike," Blog, I'm Mike,
   4. Jimmy Bogard, "Designing primary keys - Jimmy Bogard -,"
   5. Joe Celko, "Identity Vs. Uniqueidentifier (Newbie question) - microsoft.public.sqlserver.programming | Google Groups,"
   6. Jimmy Nilsson, "The Cost of GUIDs as Primary Keys,"
   7. Kenneth Downs, "The Database Programmer: Database Skills: A Sane Approach To Choosing Primary Keys,"
   8. Peter Zaitsev, "To UUID or not to UUID ? | MySQL Performance Blog,"
Jeff Atwood’s post is particularly interesting because of the huge number of comments and most of them are against the use of a GUID as a primary key.  The most common reason against GUIDs has to do with large tables in that search performance degrades more rapidly than autoincrement or sequence integer IDs. Some of the other posts argue that the degredation is not significant enough to warrant concern.
Others argue that the primary key should be an integer but the end user should see a GUID as that makes it easier to obscure the layout of the database. I’m not sure I buy that argument since it should still be pretty easy to find a given row using that GUID. I think this developer was really trying to create a level of abstraction from what a user may bookmark from something that code could efficiently work with. The strategy would allow the developer to change his primary key without impacting the user. However, it would seem reasonable to ask how much benefit there really is to having both keys when the GUID alone is enough to do all the work?
My thought on this is that the GUID is slow to look up, but once it is loaded you have the object’s real ID and hopefully it will load faster in follow-on queries using that numeric ID. A smart developer may also ask why someone would be loading a record once by GUID and then reloading repeatedly by numeric ID. Shouldn’t the original load be enough for most applications?
Many of the posts have a shared viewpoint that GUIDs do not significantly impact performance on small tables but fail to say how many rows makes a table small. They often agree that DB merges are easier using GUIDs.
My personal viewpoint is that I have not measured a significant enough performance impact to dissuade me from using GUIDs in a project that I know will be migrated to a new hosting provider. I think it is fair to use autoincrement integers or sequences when you know your database will never have to be merged or migrated and on tables with more than 10 million rows though the quality of the server may increase the number of rows.
This opinion takes us to the second part of the time: How to walk into an argument.
Computer science, like all of science, allows developers the freedom to form an opinion and later a hypothesis about how something should work or be done. We have opinions on nearly every subject and some of them are polarizing. A GUID as a primary key is one of those polarizing subjects. 
I have witnessed an explosive argument between developers arguing for and against GUIDs and both of them citing anecdotal evidence only that there is a performance issue. I have personally been involved in a discussion where the other person dismissed the very idea of GUIDs without further discussion. That same project had a significant delay as a result of a data migration problem that would have been entirely avoided had we used GUIDs.
As I said, I was planning to write more extensively on the virtues of GUIDs and then I read the other articles and realized that this is not a back-and-white solution. It is a solution that is ideal in only some situations, just like a hammer is only really useful when driving a nail or banging something into a new shape. It became clear that this subject is highly contested and would become a serious argument.
Walking into a huge discussion or argument over design principles is amazingly easy. One of the hardest things to overcome is that generalizations are often very dangerous without empirical data to back it. Only by taking risks and testing the approach, using real data, can one rule out the use of a particular construct or strategy within a web application. Thus many seemingly unorthodox approaches may actually be valid within the conditions of the application.
My conclusion is simply that GUIDs as a row ID may be perfectly reasonable for your application. It may cause some performance degradation, but the degradation may be insignificant. The only way to know is to test it and compare the performance of the key to a more common numeric key and then balance that against the requirements of your application. The only thing we should not do is dismiss the approach without first confirming that the conceivable issue is a real issue.

You Might Also Like