varchar as primary key

**chrisrlong** · 03-26-2003, 11:55 AM

Hi all,

I know I haven't been around for a while, and I completely confess to not having the patience to read this entire thread. However, I simply canot believe that this argument is still going on. Okay, yes, if someone actually goes through the trouble of making some non-sensical series of letters and characters to make a unique key, then okay, maybe storing a varchar is more efficient. HOWEVER, this is very rarely the case and you all know it. People usually use string PKs because they have some existing string value that they want to use, like Department Name. Now we should all be able to agree that such a value will definitely not be more efficient than a surrogate, numeric PK. And efficiency of space is far more important than slimdave appears to realize. This is because the size of the data has a direct effect on the efficiency of the buffer cache. The buffer cache is one of the most important pieces of performance, what with disk accesses being orders of magnitude slower than memory accesses. Further, there is a geometric curve to cache utilization. If your data is 3 times larger than it should be, this means that only 1/3 of what could fit in the cache actualy does. This means that you are 3 times more likely to need to read from the disk. Since the data is 3 times larger, it will take you 3 times as long to read it. So your database now does 9 times more disk accesses than it should have.

Further, keys are even more important than other fields. Let's look at Department Name. In an accounting application, for example, just how many tables will contain the FK of the Department Name? Lots. Further, if that were a dynamic table, all of these FKs might have indexes on them (to avoid the idiotic locking issue, if nothing else). So now, we have this very large field duplicated in many tables, and further duplicated in many indexes. This single, bad PK choice is now costing a whole lot of cache and thus a whole lot of performance.

So, yes, the difference specifically between a string and a numeric value may be small, given similar sizes, and a string might actually be able to be tighter, but first of all, who actually does that? What most novices will take away from this argument is that it doesn't matter if I use large strings as PKs - jmodic and slimdave said that was fine. Large PKs need to be avoided at all costs!

Now, even when people *do* attempt to use 'tight' string PKs, they often run into another problem. That of meaningful keys, which I have also vented against on many occasions here. For Department names, people might try to make some cute, 3-letter abbreviations for each department. Of course, when this company merges with another one that is twice as big, that little abbreviation scheme goes out the window. And on and on. There are many benefits to having meaningless keys, including performance benefits, and there are many maintenance, concurrency, and other problems with using meaningful keys. I can elaborate if need be.

There are also many problems with multiple-field keys, which is another side-effect of using string keys. People use string keys because they have them and they mean something and they're easy. They then get propogated to other tables. When a cross-reference table is created, it gets the keys from the 2 original tables because they have meaning... When a child table is created, it gets the key from the parent and another field, because the person is thinking that way. When what they *should* be doing is to create a surrogate, single field, numeric, sequence-populated PK for each table. While this may not be the best solution in 100% of the cases, it will be a better solution than what is actually done 95% of the time!

This solution, in general, leads to tighter, faster keys, tighter indexes, better cache utilization, a smaller, faster database with less concurrency and maintenance headaches down the line (which are caused by meaningful and multi-field keys).

So I do agree with slimdave on one point - this is not something that you should waste cycles worrying about - follow this simple rule to a better database. There are always special-case exceptions, but the practice, IMHO, for each case is to assume that a surrogate key is better until proven otherwise.

- Chris

Thread: varchar as primary key

Thread Tools

Display

Threaded View

Posting Permissions