Los Techies : Blogs about software and anything tech!

Storage Size And Performance Implications Of A GUID PK


I sent the same Guid vs. Int. vs BigInt question to a group of coworkers yesterday. One of the responses I got was from a DBA, and I thought was worth repeating for the world to hear. Keep in mind that these represent the opinions and research of my coworker, not me. Whether I agree with him or not, is not the point of posting his response for the world to see.

---------

Just as with most other things, there is no one correct answer.  I am a proponent of natural keys when they make sense.  Occasionally, we find that the natural key of an relation becomes cumbersome to deal with on a regular basis and we opt to utilize a surrogate key.  Surrogate key usage is a different discussion all together regardless of the data type chosen to represent it.

As far as size and scalability… in SQL Server, an int is 4 Bytes and can hold roughly 2 Billion positive unique numbers, bigint is 8 Bytes and can hold about  9.2 Quintillion (9,223,372,036,854,775,808) positive unique records.

GUIDs are typically 16 Bytes and can store 3.4 X10^38 unique values. 

There are several arguments for the use of  GUID in databases utilized by applications that we write. 

  1. You can generate the pk in code.
  2. You don’t have to execute and additional DB call to retrieve the key, post insert.
  3. Should you need to merge databases between different environments, the keys *should* not collide
    1. This assumes that some other natural key, represented by a unique index, is not violated.
  4. You all can read the rest of the reasons in [the article linked at NHForge].

That having been said, if you were to have a relation with 500 million records. The following would be true.

 

Records

PK Data Type

Data Type Size

Size Units

Size of PK

PK Unit

500,000,000

Integer

4

Bytes

1.863

GB

500,000,000

Big Integer

8

Bytes

3.725

GB

500,000,000

GUID

16

Bytes

7.451

GB

 

You can see that there is a significant different in the amount of space required to store a GUID vs Int or BigInt.  Although disk-space is cheap,  ultimately, that equates to physical contiguous blocks on a hard drive platter (until we all move to SS HDD’s) that have to be physically read or skipped during a seek event.  This can have a negative impact on database seeks that require a table scan.  Once the data is located, the GUID also consumes significantly more memory (RAM) during the retrieval…

If you are doing exact (indexed) fetches via the primary key (OLTP transactions), most of those arguments don’t hold water.  Most of what we do is OLTP.  In an OLAP environment, I strongly recommend the use of natural keys and the smallest data type available to store the data, for the reasons listed above.  When you truly have a relation with 260 million rows and that needs to be joined with a relation containing 500 million rows, an 8-12 byte increase on one of the columns used in the join (on each row) can bring even a hefty db server to its knees.  The likelihood of this happening in one of our OLTP systems is about the same as the likelihood that a GUID will be duplicated. :)

One last thought, db indexes on integer data types are EXTREMELY EFFICIENT, and extremely small.  I honestly don’t know the size difference, but I do know that an index on a GUID is inherently more complex.

Long story short.  Use your brain when you architect these things.  When it makes sense to use a GUID for application architectural reasons, by all means, use them.  However, be aware that the natural keys still exist, and your GUID may need to be done away with in a migration to a large aggregate OLAP reporting system.

---------

Now, if I could only convince him (and the rest of my company) to blog about these things, on their own! :)

Kick It on DotNetKicks.com
Posted Jul 15 2009, 10:00 AM by derick.bailey
Filed under: ,

Comments

Sanjeev Agarwal wrote Daily tech links for .net and related technologies - July 14-17, 2009
on 07-16-2009 3:27 AM

Daily tech links for .net and related technologies - July 14-17, 2009 Web Development ASP.NET Membership

Reflective Perspective - Chris Alcock » The Morning Brew #391 wrote Reflective Perspective - Chris Alcock » The Morning Brew #391
on 07-16-2009 3:33 AM

Pingback from  Reflective Perspective - Chris Alcock  » The Morning Brew #391

Sanjeev Agarwal wrote Daily tech links for .net and related technologies - July 14-17, 2009
on 07-16-2009 4:12 AM

Daily tech links for .net and related technologies - July 14-17, 2009 Web Development ASP.NET Membership

Mikael Egnér wrote re: Storage Size And Performance Implications Of A GUID PK
on 08-04-2009 3:33 AM

I think another relevant problem with GUIDs is in the cache of the database server.

If you use the PK as the clustered index (which is common), then all other indexes on that table will have the GUID value stored as well (for the lookup).

When you add more and more indexes, the size of the GUID in the PK and the indexes will take a lot of space.

That space is not only relevant when reading from disk, but also decides how much "logical" data that could be stored in the cache of SQL-server.

The performance for queries that could use the cache is very much faster than when SQL-server have to read from disk

Our database is near 1 TB of data with very many GUIDs...

colinjack wrote re: Storage Size And Performance Implications Of A GUID PK
on 08-14-2009 3:56 AM

Really good stuff.

Copyright Los Techies 2008, 2009. All rights reserved.
Powered by Community Server (Commercial Edition), by Telligent Systems