Natural vs Synthetic Keys
I recently post a thread regarding how about load data to Fact tables using surrogate keys. How have you done that in the past? In my opinion for example surrogate keys made things more complex because to load the fact table data I have to obtain the surrogate keys from the time dimension.
The complexity of the job depends a lot on what tool you're using, I guess. The mainstream tools such as Informatica usually provide "wizards" for this sort of thing, probably with varying degrees of success. So you can run a wizard to maintain a dimension of a particular slowly-changing type and it will do most of the work for you.
Originally Posted by pando
It's the same old story though -- DWh's are a different kettle of fish. For example if your source system sends you synthetic keys of its own, are those in fact "natural" by some definitions because they were generated external to the data warehouse?
The decision support/DWh system that I currently work on does in fact make almost exclusive use of natural keys -- they were the choice of the original designer, and it's an unfortunate reality that in order to change just about anything in our system (add new column, change general ledger account tables) we have to recreate the entire stinkin' fact tables anyway. It's not mission essential so we can get away with doing these things "asynchronously" (by migrating static old data first and more dynamic later data last) with a little downtime thrown in without causing users any problems. The major reason for using them though is that they were so deeply embedded in the way that the system was initially designed that it seems contrary to change now.
Date warehouses: who'd have 'em, eh? How are you enjoying the DWh experience, by the way?
Letīs use the time dimension example, if I use surrogate keys for time dimension I would have (for example) 100 for 20050601, 110 for 20050611 etc. If I wanna load 10 million rows with mixed dates (20050601, 20050523.. so on) into the fact table I would have to query the dimension to get the PK and load it to the fact table, isnt this a hassle?
This has always been a bit of a philisophical conundrum for me, because I always think, "Yeah, dates are OK. No need for a synthetic there" but have not put my finger on the reason why.
Well, maybe, let me float a theory past you guys and see what you think, 'cos I'm not sure if this is a flash of insight or just sophistry -- it could go either way.
Start with two points:
- The difference between a natural key and a synthetic key is that a natural key has some meaning to the naked eye in isolation, and thus can be exposed directly to the user and "generated" by users. A synthetic key's meaning is only revealed through translation to a natural value.
- If you change the value of a natural key, you're genera'ly not changing what that key value represents, eg. "Finland", you are just changing it's means of representation -- ie. some visible code value.
Now what is a date value in Oracle? It's some internal representation -- a string of digits I suppose. Let me float the idea that a date column is actually a synthetic key, but appears to be natural because Oracle uses one of a great many possible conversions to represent it as a naturalkey. So if your boss came in one morning and said, "Bad news lads, the business is changing the standard date format to be Julian", or you wanted to suddenly start representing month names in ... um ... Urdu, then just as in the case where a country changes its code we're not saying that the date itself has changed, just the "human-friendly" representation of it. You do this pretty simply through changing NLS settings -- internally if you have quality code that always uses a date format picture to say such things as
then you don't even have to change that.
Where date_of_transaction < To_Date('01-Jan-2005','DD-Mon-YYYY')
So here's my proposition: date-based keys are really synthetic, and that's why they appear to be an exception (and that's why representing them with a regular sequence-generated synthetic value makes no sense) -- it's the representation of them, with year-month-day or whatever, that is natural, and in effect Oracle is maintaining an internal and very flexible two-way translation table for you, accessed through the to_date() and to_char() functions (not literally a table of course, but with the same purpose).
Feedback, feedback! What do you think.
hmmm what it bothers me is how do you load Dimension surrogate keys in the fact table in a DWH!
Technology dependent, I suppose. With a SQL/PL/SQL approach I guess you'd "cleanse" your data in a temporary intermediate table between the source and the fact table, then check for any key values in it that do not exist in the dimensions, generating new synthetic keys as you find them, then you'd join this temporary table to the dimensions (hash joins all the way, I'd expect) as you insert from the temporary table into the fact table (or the table with which you will exchange a partition of the fact table). NOLOGGING would be used where you're moving around that fact table data to reduce probs there.
Maybe OWB works differently to that -- Informatica certainly does. In Informatica the "approved" method would be to stream the data through a series of lookup transformations that would both lookup existing dimension values and would identify cases where a new row is required etc.. It's not too painful -- more work than natural keys of course.
Why this thread is in "Obfuscation Unlimited" ? OU Mods might consider it to move to Admin Forum.
Oracle Certified Professional 8i, 9i.
"The degree of normality in a database is inversely proportional to that of its DBA"
It started off by referring to external discussions and blogs, so was initially OU.
Originally Posted by SANJAY_G
I remember when this place was cool.
Click Here to Expand Forum to Full Width