Joe Celko on IDENTITY
DBAsupport.com Forums - Powered by vBulletin
Results 1 to 9 of 9

Thread: Joe Celko on IDENTITY

  1. #1
    Join Date
    Nov 2000
    Location
    Baltimore, MD USA
    Posts
    1,339
    Thought this would be interesting for this forum to discuss. While he is addressing SQLServer IDENTITY fields, you can easily replace IDENTITY with Trigger-Sequence-populated PK field. The crux of the argument lies in point 5, where he states:
    "If you have designed your tables correctly, they will have a
    meaningful primary key derived from the nature of the entity they model. "

    For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

    However, I would like to hear what everyone else thinks about this:



    >> How is the use of an IDENTITY column as a primary key not relational?
    I'm
    under the impression it is. <<

    Real SQL programmers use real keys and do not try to imitate 1950's
    magnetic
    tape or punch card systems in a relational model. In the early days of
    programming languages, we exposed a lot of the physical and hardware
    implementation to the programmer. For example, COBOL and FORTRAN depend
    on
    physically contiguous storage of data, which made a ROWID (i.e. physical
    location reference) possible.

    Later, we designed languages to independent of any physical and hardware
    implementation. Thus, an INTEGER datatype did not have to be eight
    binary
    bits in a two complement format. It just had to behave like an integer
    in
    the program and the programmer did not have to worry about how the
    hardware
    did its work -- or even know what the hardware was.

    SQL and other modern programming languages carry this idea further and
    try
    to completely separate logical and physical implementations. The idea
    is
    that SQL is based on sets (which have no ordering) and a standard that
    defines its logical behavior. The behavior has nothing to do with
    whether a
    product uses B-Tree indexes, bit vectors or hashing; two complement or
    base
    ten arithmetic; whether the host program calling the SQL is C or Cobol;
    etc.

    The IDENTITY column in SQL Server is an attempt to return to those
    thrilling
    days of yesteryear, and the conceptual junk left over from the days when
    people did not know much about relational databases. What we knew was
    sequential file systems -- punch cards and magnetic tapes. Even the
    disk
    file systems mimicked these systems, adding only simple indexes.

    Sequence was a basic way of looking at data. People thought in terms of
    them at a primitive level. A sequence of bits make a byte, a sequence
    of
    bytes make a field, a sequence of fields make a record and sequence of
    records make a file. Very low level, very close to the machinery.

    Then along comes the relational model. It is based on sets; a set is a
    completed whole, without any ordering to it. No sequences! Very
    abstract!
    Programmers did not know how to cope, so the vendors exposed the
    physical
    implementation and called these things "features" and locked their
    products
    to particular architectures. I can go into details on that problem, but
    let
    me say that when we went to the bar after ANSI X3H2 meetings, the
    vendors
    griped about what they had to do to these extensions to preserve them in
    the
    next platform, how they could not scale up to data warehouse size
    databases,
    etc.

    The IDENTITY column is one of these mistakes.

    1) It is not part of the SQL-92 Standard and it is highly proprietary to
    the
    Sybase family. It is not portable -- not quite the same thing as
    proprietary, since you can often translate one SQL dialect into another
    with
    a simple replacement (i.e. the % operator becomes the MOD() function).
    So
    your code will not move over to a new database.

    2) IDENTITY looks like a datatype, but it is not. Create a table with
    one
    column in it and make it an IDENTITY column. Insert a number into the
    table
    and see what happens. Try to set it to NULL. If you cannot insert,
    update
    and delete all the columns, then this is not a table!

    3) IDENTITY looks like a constraint, but it is not. Try to create a
    table
    with two IDENTITY columns and it fails. If you cannot add it to a
    column,
    then it is not a constraint. It is possible to write a a set of
    constraints
    that prohibit data from ever being put in the table (their predicate is
    always FALSE). It is possible to write a a set of constraints that
    allow
    anything in the table (their predicate is always TRUE). But no
    constraint
    can prohibit the creation of the table itself -- that is a
    meta-constraint.

    4) It is not relational. Consider this statement on a table, Foo, which
    has
    an identity column. Assume the query returns more than one row.

    INSERT INTO Foo (x)
    SELECT a FROM Bar;

    You will get a result like this:

    IDENTITY X
    ============
    1 'a'
    2 'b'
    3 'c'

    but if the query changed an index or was put on the physical disk data
    page
    differently, you might have gotten:

    IDENTITY X
    ============
    1 'b'
    2 'c'
    3 'a'

    Explain why one result is the logically correct choice for an identifier
    and
    all other choices are not, without any reference to the physical
    implementation. You cannot.

    Instead of treating the query as a set, you are doing 1950's sequential
    processing using the underlying sequential file system the Sybase family
    started with.

    5) If you have designed your tables correctly, they will have a
    meaningful
    primary key derived from the nature of the entity they model. The
    IDENTITY
    column should be a redundant key. The reason IDENTITY columns are
    popular
    as keys is that they are easy to declare. This is also the same reason
    that
    people build non-normalized databases and put pennies in fuse boxes --
    easy
    is not right.

    6) It is a bitch to do calculations on IDENTITY column values. Well, it
    was
    hard to do direct math on the sequential position of a record in a
    1950's
    punch card system and that it what the IDENTITY is mimicking.

    7) There is no check digit in an IDENTITY columns value, so you have no
    way
    of verifying it if you use it as a key.

    8) If you use IDENTITY as a key, the values tend to cluster on physical
    data
    pages because they are sequential. The result is that if the most
    recent
    rows are the most likely to be accessed, there will be locking
    contention
    for control of those physical data pages. What you really wanted in a
    key
    is some spread of the rows over physical storage to avoid having every
    user
    trying to get to the same page at the same time.

    9) The actual implementation of the IDENTITY column has been problematic
    since Version 7.0. You can look up threads in the news groups to get
    assorted tales of woe.

    --CELKO--

    SQL guru at Trilogy
    ===========================




    - Chris

  2. #2
    Join Date
    Dec 2000
    Location
    Ljubljana, Slovenia
    Posts
    4,439
    Originally posted by chrisrlong
    For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

    However, I would like to hear what everyone else thinks about this:
    ..SNIP...
    Not going into details about IDENTIRY in SQL Server as I don't have any exeriences with it, I strongly disagree with Celko (and agree with you) about the use of surrogat keys for PKs. Particulary in DW environment with dimension models (star- and snowflake designs), surrogat keys are a must. Ralph Kimbal as one of the most prominent expert for DW concepts doesn't even want to discuss anything other than meaningless numbers (surrogate keys) for PKs.
    Jurij Modic
    ASCII a stupid question, get a stupid ANSI
    24 hours in a day .... 24 beer in a case .... coincidence?

  3. #3
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    Originally posted by chrisrlong
    Thought this would be interesting for this forum to discuss. While he is addressing SQLServer IDENTITY fields, you can easily replace IDENTITY with Trigger-Sequence-populated PK field. The crux of the argument lies in point 5, where he states:
    "If you have designed your tables correctly, they will have a
    meaningful primary key derived from the nature of the entity they model. "

    For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!
    Wow, that's a pretty strong statement. I agree that there are certain situations where a surrogate PK is necessary. However, there are other situations in which the use of a surrogate PK will hamper the physical implementation. For example, say you have a table that contains historical and is structured as:
    trade_dt date,
    exchange_cd varchar2(10),
    symbol_cd varchar2(10),
    price number(20,10)

    The natural PK for this table would be trade_dt, exchange_cd, and symbol_cd.

    As the DBA I would want to option to partition this table based on the date range. That way, I could move my "active" data to a faster tablespace and my "archive" data to a slower tablespace.

    However, if you add a surrogate PK to the table and structure it as:
    surrogate_id number(10),
    trade_dt date,
    exchange_cd varchar2(10),
    symbol_cd varchar2(10),
    price number(20,10)

    partitioning based on date range becomes difficult. If you want to use local partitions, you have to estimate how many ID number will fall in the particular time period you are interested in partitioning.

    You could continue to use a regular index and partition on date, but when you move a partition from one tablespace to another the index becomes unusable and you have to rebuild it.

    I think in this situation, using an artificial PK hampers the design.

    However, I would like to hear what everyone else thinks about this:



    >> How is the use of an IDENTITY column as a primary key not relational?
    I'm
    under the impression it is. <<
    Isn't one of the rules of data normalization that each attribute must contribute to the relation? (or something like that). I think his take is that since a child record doesn't hold meaning without it's parent record, why does it need an ID number? It's just extra "stuff" along for the ride.

    That being said, I think it's perfectly reasonable to use sequential numbers for things like ID numbers, time-ordered events, etc.
    Jeff Hunter
    marist89@yahoo.com
    http://marist89.blogspot.com/
    Get Firefox!
    "I pledge to stop eating sharks fin soup and will not do so under any circumstances."

  4. #4
    Join Date
    Aug 2000
    Posts
    194
    "Isn't one of the rules of data normalization that each attribute must contribute to the relation? (or something like that). I think his take is that since a child record doesn't hold meaning without it's parent record, why does it need an ID number"

    I guess, in this case whenever child records are inserted, the IDENTITY (or PK) of the parent table is also carried over, to have the relation maintained. (here the parent key as such is not a part of the child's Primary Key)

    Thus,

    Table1:
    ----------
    Tab1Id1 number, --PK (IDENTITY)
    Tab1Comp1 number,
    Tab1Comp2 varchae2(10),
    Tab1Comp3 varchae2(10),
    Col1 varcha2(10),
    Col2 varcha2(10),
    Constraint UQ_Tab1Com
    Unique (Tab1Comp1, Tab1Comp2, Tab1Comp3)


    Table2:
    ----------
    Tab2Id1 number, --PK (IDENTITY)
    Tab1Id1 number, --FK
    Tab2Col1 varcha2(10),
    Tab2Col2 varcha2(10)


    It does make sense in some cases and may not at some other places as noted by Jeff.


  5. #5
    Join Date
    Nov 2000
    Location
    Baltimore, MD USA
    Posts
    1,339
    Actually, I know of no normalization rule that dis-allows the use of a surrogate primary key, but I'll have to think about that some more, as I am also a major proponent of normalization. I'd hate think I've been contradicting myself .

    And, yes, I guess I was a little over-emphatic in my support or surrogate keys, but not much. There are sooo many benefits to using surrogate keys, that one has to have a very strong argument to convince me not to use one.

    And speaking of such an argument , you may have one. Can you give me some more detail about the locally-partitioned index issue. Before I go out with a book preaching the gospel according to surrogate keys, I'd like to better understand any drawbacks , and I hadn't really picked up on that one - partitioning is one of the things I have not seriously dug into yet. On the surface, however, it sounds like something with a significant-enough fallout that maybe Oracle is planning on changing it in 9i. Much like their lame notion of 'PL/SQL's SQL engine lags behind the base SQL engine' that is being 'fixed' in 9i.

    Anyway, if you could explain that to me a little more or feed me a link, it would be greatly appreciated. Also, if there are any other drawbacks to surrogate keys that you can think of, please let me know.

    Thanks,

    - Chris

  6. #6
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    In general, you can look at:
    http://technet.oracle.com/docs/produ...rti.htm#437783

    Local indexes are discussed in:
    http://technet.oracle.com/docs/produ...rtiti.htm#5979

    Basically, when you partition a table by range, your table is conceptually split up into many different tables each containing a certain set of the data. A local index is an index that is partitioned as well using the same partition bounds as the base table. For example, take my table from before:
    trade_dt date,
    exchange_cd varchar2(10),
    symbol_cd varchar2(10),
    price number(20,10)

    Say you wanted this table partitioned by month, you would create 12 partitions. A local index would also have 12 partitions each corresponding to the partition of the base table. (The table partition for JAN would have a corresponding partition for JAN). You essentially have 12 tables and 12 indexes that Oracle knows how to put back together.

    The problem comes in when you don't partition on the PK. Say we take:
    surrogate_id number(10),
    trade_dt date,
    exchange_cd varchar2(10),
    symbol_cd varchar2(10),
    price number(20,10)

    and partition it on trade_dt by month. If the surrogate_id is the PK, it can NOT be a partitioned index because it is not part of the partition key. Therefore, if you modify any of the partitions, Oracle doesn't know how to tell the index what's going on and it flags the index as "UNUSABLE". You then have to go and rebuild the index.
    Jeff Hunter
    marist89@yahoo.com
    http://marist89.blogspot.com/
    Get Firefox!
    "I pledge to stop eating sharks fin soup and will not do so under any circumstances."

  7. #7
    Join Date
    Dec 2000
    Location
    Ljubljana, Slovenia
    Posts
    4,439
    Jeff,

    Your remarks about problems with rate partitioning are valid, but only up to some point. As allways, there are pros and cons about using real meaningfull values for PKs. You have to draw a line under all pluses and minuses and see what to use, all in the context of your bussines rules.

    You say the things get complicated when you want to use date-range partitioning in your example with surrogat PK when you want to move partitions etc. True, but the partition-managing tasks are generally more an exception than a rule, so in those unavoidable exceptional situations DBA should be able to perform a global index rebuild as a part of a partition-managing task.

    On the other hand, what if your bussines requires that sometime trade_dt date value can be changed (clerk who entered it initially made a mistake, he should be able to correct that mistake...)? In the lack of update-cascade functionality it sometimes gets so complicated to implement this. But if you have surrogate PK it is a simple update, nothing else.

    Speaking of date datatype as a part of a primary key, it is sometimes unavoidable to use surrogate values instead of dates in PKs. A second as the lowest date granularity is simply not sufficient in many real life cases to relay on this datatype as a distinctive value for each record. If Oracle had the timestamp datatype it would be different, but then again - isn't the timestamp also a kind of surrogate value?

    Jurij Modic
    ASCII a stupid question, get a stupid ANSI
    24 hours in a day .... 24 beer in a case .... coincidence?

  8. #8
    Join Date
    Nov 2000
    Location
    Baltimore, MD USA
    Posts
    1,339

    Thumbs up

    First, thanks for the input, and I will absolutely look into the issue some more.

    But back to the original topic:

    We have a break-through!!!

    A friend of mine sent a message to Mr. Celko asking him several questions about his stance on IDENTITY columns. After a lot of other feedback (including a massive dissertation on the makup of SSNs), we have this nugget towards the end:

    "A surrogate key is fine with me, but I don't think we mean the same thing by that term. A surrogate is NEVER exposed to the user at all, but is created by the database engine itself for speed -- a hash code for a long key, a pointer based on the current physical storage, etc.

    An artificial key is exposed to the users and it is something that was made up as an identifier for our system. Let's say we have no industry standard customer number in our model, so we want to assign one to everybody we do business with. It will be exposed to the customer on all his paper work with us. So we want a check digit to be sure that we can verify it when it is keyed in. We would like it to be big enough to over
    all possible customers we will ever have. This and a lot of other things are part of data design. "

    Sooo, the universe once again makes sense. My head has been spinning the last couple hours trying to understand how one of my idols could have fallen so far. It seems it was mostly a mis-undertanding. To translate, his idea of a surrogate key is the same as ours, or at least mine - it is *never* displayed to the user! What he has been railing against (besides the IDENTITY implementation) has been what he calls 'artificial' keys, which *are* displayed to the user - his example of Customer ID brings it into perfect relief. I also now understand what he means about 'verifiability' and check digits. And on top of that, I agree with much of what he is saying. It is, indeed, part of the data design process to determine how such artifical keys will be generated, provided and verified.



    All is once again right with the world

    - Chris

    [Edited by chrisrlong on 04-26-2001 at 05:01 PM]

  9. #9
    Join Date
    Apr 2001
    Posts
    219
    I use surrogates and natural keys, I pick them based on many different factors like:

    1. How is the table going to be used (ie parent, child, lookup, etc..)
    2. The extra, possibly uselesss, index created by a surrogate key (Becuase I have a system that inserts over 2 million rows an hour on a weak machine. I do not want the extra overhead of a potentialy useless index)
    3. Possible duplicate entries
    4. Denormalization
    5. and the list goes on

    Here is a page I found on the subject.

    http://www.dorna-consultants.com/consulting.htm

    Here is my two-cents on this subject.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width