varchar as primary key - Page 3
DBAsupport.com Forums - Powered by vBulletin
Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 32

Thread: varchar as primary key

  1. #21
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    They certainly tend to generate a lot of hot air.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  2. #22
    Join Date
    May 2000
    Location
    ATLANTA, GA, USA
    Posts
    3,135
    I have create 2 tables T1 and T2.
    Table T1

    Name Null? Type
    --------------------- -------- -------------------
    ROW_NUM NOT NULL NUMBER
    OWNER VARCHAR2(30)
    OBJECT_NAME VARCHAR2(128)
    SUBOBJECT_NAME VARCHAR2(30)
    OBJECT_ID NUMBER
    DATA_OBJECT_ID NUMBER
    OBJECT_TYPE VARCHAR2(18)
    CREATED DATE
    LAST_DDL_TIME DATE
    TIMESTAMP VARCHAR2(19)
    STATUS VARCHAR2(7)
    TEMPORARY VARCHAR2(1)
    GENERATED VARCHAR2(1)
    SECONDARY VARCHAR2(1)

    Table T2
    Name Null? Type
    --------------------- -------- -------------------
    ROW_NUM NOT NULL VARCHAR2(10)
    OWNER VARCHAR2(30)
    OBJECT_NAME VARCHAR2(128)
    SUBOBJECT_NAME VARCHAR2(30)
    OBJECT_ID NUMBER
    DATA_OBJECT_ID NUMBER
    OBJECT_TYPE VARCHAR2(18)
    CREATED DATE
    LAST_DDL_TIME DATE
    TIMESTAMP VARCHAR2(19)
    STATUS VARCHAR2(7)
    TEMPORARY VARCHAR2(1)
    GENERATED VARCHAR2(1)
    SECONDARY VARCHAR2(1)

    I have PKs on ROW_NUM columns on both tables.

    select count(1) from t1;
    COUNT(1)
    ----------
    1890288

    14:20:10 H8DEVW3>select count(1) from t2;

    COUNT(1)
    ----------
    1890288

    Both tables have same number of rows (1.8 Million) and same data.

    Here is my test results:

    14:10:23 H8DEVW3>update t1 set row_num = 10 where row_num = 40000;
    update t1 set row_num = 10 where row_num = 40000
    *
    ERROR at line 1:
    ORA-00001: unique constraint (SYSADM.T1_PK) violated

    Elapsed: 00:00:00.80

    14:11:10 H8DEVW3>update t2 set row_num ='10' where row_num ='40000';
    update t2 set row_num ='10' where row_num ='40000'
    *
    ERROR at line 1:
    ORA-00001: unique constraint (SYSADM.T2_PK) violated

    Elapsed: 00:00:00.80

    In my test I found no time difference for searching a unique key in T1_PK and T2_PK.

    However, the space required for T1_PK Index is less than that of T2_PK index.
    BLOCKS NAME LF_ROWS LF_BLKS BR_BLKS BR_ROWS BTREE_SPACE USED_SPACE
    ---------- --------- ---------- ---------- ---------- ---------- ----------- ----------
    4480 T1_PK 1890288 4070 7 4069 32599916 29262117

    BLOCKS NAME LF_ROWS LF_BLKS BR_BLKS BR_ROWS BTREE_SPACE USED_SPACE
    ---------- --------- ---------- ---------- ---------- ---------- ----------- ----------
    5120 T2_PK 1890288 4588 9 4587 36757900 32975669

    T2_PK index needed an extra 518 (4588 - 4070) blocks for the leaf and an additional 2 blocks for the branch because of the varchar2 data type.

    Certainly number data type requires less space than the varchar2 data type. Any serach on index requires minimum of 3 Logical I/Os. That is why I got the same elapsed time from both indexes. If I start searching 1000s rows using index, i will get different response time.

    But if you are going to search a single row or less than 100 rows using index, then either data type will work efficiently. Please remember that defining a data type for a column is entirely dependant on application.

  3. #23
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    just to return to my original point lest it be forgotten, basing the choice of what data type should be used for a PK on any alleged performance benefits of one type over another is not a good way to spend your valuable work hours. You are chasing performance benefits of fractions of a percent for your application, and you risk degrading the flexibility/supportabilty/expandability of your data model.

    i) use a data type that is appropriate for the data being stored.
    ii) if your pk is synthetic, and thus probably numeric, then NUMBER is evidently the correct choice.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  4. #24
    Join Date
    Dec 2000
    Location
    Ljubljana, Slovenia
    Posts
    4,439
    First of all, I don't want to object any of tamilselvan's observations because they are all justified and logical. However I do find the following to deserve some additional explanation:
    Originally posted by tamilselvan
    Certainly number data type requires less space than the varchar2 data type. Any serach on index requires minimum of 3 Logical I/Os. That is why I got the same elapsed time from both indexes. If I start searching 1000s rows using index, i will get different response time.
    The above is true only under the pressumption that in VARCHAR2 column you are actualy storing the numeric values (converted to strings) - in that case of course the space consumed for the same numeric values will be greater in VARCHAR2 column than if they are stored in NUMBER column. Of course noone in his right mind would want to store let's say ******* values in a VARCHAR column. However you can also think of it another way:

    If we want to assign unique ******* values to those 1.8 milions, those ******* numbers will on avarage consume somwhere between 4 and 5 bytes each - so those numbers will require about 8 milion bytes of storage (that's just my estimate, don't take it literaly).

    Now let's say that we assign unique VARCHAR2 values to those 1.8 million of records, where values can be composed of normal ALPHANUMERICAL characters (A-Z, a-z, 0-9, if we set asside those other more "exotic" characters) - that means that each of those 1,8 unique values will consume no more than 3 bytes, so in total about 5 milions bytes (with single byte chatacterset).

    So in effect, for uniquely identifying 1.8 million of rows with ******* numeric values will require about 30% more space compared to if we use "normal" alphanumeric values! That's quite normal, because in effect in the first case we use base-10 numeric system, while in the second case we are using base-64 (or something similar) numeric system, which is much more efficient storage-wise. Of course it is totaly different isue which of the two optons have more practical use .....

    So my point is: Saying that using numeric datatype for PK is more efficient compare dto VARCHAR2 datatype because of the sawing in space consumed by the underlying index is only true *under certain circumstances* - in general it is just the other way arround!

    I think slimdave has realy given the most accurate bottomline of this whole isue:
    Originally posted by slimdave
    i) use a data type that is appropriate for the data being stored.
    ii) if your pk is synthetic, and thus probably numeric, then NUMBER is evidently the correct choice.
    Everything else is just speculations....
    Jurij Modic
    ASCII a stupid question, get a stupid ANSI
    24 hours in a day .... 24 beer in a case .... coincidence?

  5. #25
    Join Date
    Nov 2002
    Location
    Geneva Switzerland
    Posts
    3,142
    Originally posted by slimdave
    i) use a data type that is appropriate for the data being stored.
    ii) if your pk is synthetic, and thus probably numeric, then NUMBER is evidently the correct choice.
    AMEN!
    "The power of instruction is seldom of much efficacy except in those happy dispositions where it is almost superfluous" - Gibbon, quoted by R.P.Feynman

  6. #26
    Join Date
    Dec 2002
    Location
    Bangalore ( India )
    Posts
    2,434
    Originally posted by jmodic
    You are talking nonsence here.
    Wrong conclusions due to previously explained misunderstandings. The time difference even on a big table would be negligible, I can assure you.
    The time difference wud be or order of milliseconds...PS below demo in my test environment.

    Pin column of reporting_PRODUCT_Hier is Varchar type.
    Pin column of temp_reporting_PRODUCT_Hier is number type.

    Code:
    US18>select count(*) from reporting_PRODUCT_Hier;
    
      COUNT(*)
    ----------
        421051
    
    Elapsed: 00:00:00.34
    
    Execution Plan
    ----------------------------------------------------------
       0      SELECT STATEMENT Optimizer=CHOOSE (Cost=130 Card=1)
       1    0   SORT (AGGREGATE)
       2    1     INDEX (FAST FULL SCAN) OF 'PK_REPORTING_PRODUCT_HIER' (U
              NIQUE) (Cost=130 Card=421043)
    
    
    
    
    US18>select count(*) from temp_reporting_PRODUCT_Hier;
    
      COUNT(*)
    ----------
        421047
    
    Elapsed: 00:00:00.18
    
    Execution Plan
    ----------------------------------------------------------
       0      SELECT STATEMENT Optimizer=CHOOSE (Cost=99 Card=1)
       1    0   SORT (AGGREGATE)
       2    1     INDEX (FAST FULL SCAN) OF 'PK_TEMP_REPORTING_PRODUCT_HIE
              R' (UNIQUE) (Cost=99 Card=421047)
    
    
    
    
    US18>insert into reporting_PRODUCT_Hier(pin,fordate) values('023252','20-Mar-03');
    
    1 row created.
    
    Elapsed: 00:00:00.94
    
    Execution Plan
    ----------------------------------------------------------
       0      INSERT STATEMENT Optimizer=CHOOSE (Cost=2604 Card=421043 Byt
              es=133891674)
    
    
    US18>insert into temp_reporting_PRODUCT_Hier(pin,fordate) values(023252,'20-Mar-03');
    
    1 row created.
    
    Elapsed: 00:00:00.15
    
    Execution Plan
    ----------------------------------------------------------
       0      INSERT STATEMENT Optimizer=CHOOSE (Cost=2568 Card=421047 Byt
              es=131787711)
    
    
    US18>insert into reporting_PRODUCT_Hier(pin,fordate) values('akjshj1','20-Mar-03');
    
    1 row created.
    
    Elapsed: 00:00:00.31
    
    Execution Plan
    ----------------------------------------------------------
       0      INSERT STATEMENT Optimizer=CHOOSE (Cost=2604 Card=421043 Byt
              es=133891674)
    
    
    US18>insert into temp_reporting_PRODUCT_Hier(pin,fordate) values(9827341,'20-Mar-03');
    
    1 row created.
    
    Elapsed: 00:00:00.16
    
    Execution Plan
    ----------------------------------------------------------
       0      INSERT STATEMENT Optimizer=CHOOSE (Cost=2568 Card=421047 Byt
              es=131787711)
    If u just see above eg u wud see that a worst case diff wud be 150ms for one insert of record.

    Now say u had to insert 100 records...takeing worst case wud be around 15 secs more that wat u require for number datatype..

    now for 1000 it wud require 150 secs more time...

    for 10000 records it wud be 1500 secs more time & this wud be unacceptable.

    and more ever wat i stated in my first post was
    Performance might slightly get affected during insert.

    Abhay.


    __________________
    funky...
    and indeed it will.

    and 1 more thing its all application dependent whether to use varchar as PK or not...i didnt say u shudnt use...but S performance wud slightly be affected.

    Abhay.
    funky...

    "I Dont Want To Follow A Path, I would Rather Go Where There Is No Path And Leave A Trail."

    "Ego is the worst thing many have, try to overcome it & you will be the best, if not good, person on this earth"

  7. #27
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Still "measuring" CPU time using elapsed time eh?



    You still haven't proved anything about performance. Give us a proper measurement and then we'll talk again.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  8. #28
    Join Date
    Nov 2000
    Location
    Pittsburgh, PA
    Posts
    4,015
    There is another thing to consider. Using surrogate keys could cause you application to need more joins Thus negating you 15 millisecond performance gain. I think that the real answer depends on what type of data you are trying to store. Whether or not to use varchar2's or numbers depends on a number of factors. There are situations where Varchar2's would be preferable and there are situations where they would not be preferable. One other consideration is that if you are using a true surrogate key then the value of the key is probably derived from a sequence. Fetching the sequence might take a few milliseconds thus throwing off your numbers.

  9. #29
    Join Date
    Nov 2000
    Location
    Baltimore, MD USA
    Posts
    1,339
    Hi all,

    I know I haven't been around for a while, and I completely confess to not having the patience to read this entire thread. However, I simply canot believe that this argument is still going on. Okay, yes, if someone actually goes through the trouble of making some non-sensical series of letters and characters to make a unique key, then okay, maybe storing a varchar is more efficient. HOWEVER, this is very rarely the case and you all know it. People usually use string PKs because they have some existing string value that they want to use, like Department Name. Now we should all be able to agree that such a value will definitely not be more efficient than a surrogate, numeric PK. And efficiency of space is far more important than slimdave appears to realize. This is because the size of the data has a direct effect on the efficiency of the buffer cache. The buffer cache is one of the most important pieces of performance, what with disk accesses being orders of magnitude slower than memory accesses. Further, there is a geometric curve to cache utilization. If your data is 3 times larger than it should be, this means that only 1/3 of what could fit in the cache actualy does. This means that you are 3 times more likely to need to read from the disk. Since the data is 3 times larger, it will take you 3 times as long to read it. So your database now does 9 times more disk accesses than it should have.

    Further, keys are even more important than other fields. Let's look at Department Name. In an accounting application, for example, just how many tables will contain the FK of the Department Name? Lots. Further, if that were a dynamic table, all of these FKs might have indexes on them (to avoid the idiotic locking issue, if nothing else). So now, we have this very large field duplicated in many tables, and further duplicated in many indexes. This single, bad PK choice is now costing a whole lot of cache and thus a whole lot of performance.

    So, yes, the difference specifically between a string and a numeric value may be small, given similar sizes, and a string might actually be able to be tighter, but first of all, who actually does that? What most novices will take away from this argument is that it doesn't matter if I use large strings as PKs - jmodic and slimdave said that was fine. Large PKs need to be avoided at all costs!

    Now, even when people *do* attempt to use 'tight' string PKs, they often run into another problem. That of meaningful keys, which I have also vented against on many occasions here. For Department names, people might try to make some cute, 3-letter abbreviations for each department. Of course, when this company merges with another one that is twice as big, that little abbreviation scheme goes out the window. And on and on. There are many benefits to having meaningless keys, including performance benefits, and there are many maintenance, concurrency, and other problems with using meaningful keys. I can elaborate if need be.

    There are also many problems with multiple-field keys, which is another side-effect of using string keys. People use string keys because they have them and they mean something and they're easy. They then get propogated to other tables. When a cross-reference table is created, it gets the keys from the 2 original tables because they have meaning... When a child table is created, it gets the key from the parent and another field, because the person is thinking that way. When what they *should* be doing is to create a surrogate, single field, numeric, sequence-populated PK for each table. While this may not be the best solution in 100% of the cases, it will be a better solution than what is actually done 95% of the time!

    This solution, in general, leads to tighter, faster keys, tighter indexes, better cache utilization, a smaller, faster database with less concurrency and maintenance headaches down the line (which are caused by meaningful and multi-field keys).

    So I do agree with slimdave on one point - this is not something that you should waste cycles worrying about - follow this simple rule to a better database. There are always special-case exceptions, but the practice, IMHO, for each case is to assume that a surrogate key is better until proven otherwise.


    - Chris
    Christopher R. Long
    ChrisRLong@HotMail.Com
    But that's just my opinion. I could be wrong

  10. #30
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    I got into this because of some ridiculous pseudo-measurements that attempted to prove the benefits of numerics over varchars using bogus methodologies, but don't take my defence of varchars as being a dismissal of synthetic numeric keys.

    if someone actually goes through the trouble of making some non-sensical series of letters and characters to make a unique key, then okay, maybe storing a varchar is more efficient. HOWEVER, this is very rarely the case and you all know it.
    Why would anyone do this? No-one has suggested doing it - it would be a dumb idea.

    efficiency of space is far more important than slimdave appears to realize.
    Haven't even mentioned field size, but you seem to be assuming that varchar2's must be lengthy fields -- could be a varchar2(2) though couldn't it? Long varchars are surely a bad idea for key fields, so that's a good point.

    What most novices will take away from this argument is that it doesn't matter if I use large strings as PKs - jmodic and slimdave said that was fine.
    I doubt it -- jmodic and me never said any such thing, so i don't see how they could. Again, you're assuming that varchar2's are lengthy fields.

    There are many benefits to having meaningless keys, including performance benefits, and there are many maintenance, concurrency, and other problems with using meaningful keys. I can elaborate if need be.
    No-one's arguing about the benefits of synthetic keys, they are generaly the best choice, but there are cases where a synthetic key is neither required nor beneficial - how about US State abbreviations, or international currency abbreviations (USD,GBP etc)? Nothing wrong with using those as varchar2's.

    People use string keys because they have them and they mean something and they're easy. They then get propogated to other tables. When a cross-reference table is created, it gets the keys from the 2 original tables because they have meaning... When a child table is created, it gets the key from the parent and another field, because the person is thinking that way. When what they *should* be doing is to create a surrogate, single field, numeric, sequence-populated PK for each table. While this may not be the best solution in 100% of the cases, it will be a better solution than what is actually done 95% of the time!
    Sure, use a synthetic numeric key for such cases. It is the appropriate choice.

    So I do agree with slimdave on one point - this is not something that you should waste cycles worrying about - follow this simple rule to a better database. There are always special-case exceptions, but the practice, IMHO, for each case is to assume that a surrogate key is better until proven otherwise.
    Yes, synthetic keys are a reasonable choice in 99% of cases. Let's not just dismiss the other 1% of special cases.

    Anyway, this lengthy discussion has tuned out pretty well, i tink, as we've covered a lot of diverse considerations.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width