DBAsupport.com Forums - Powered by vBulletin
Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Time Dimension in DW: To the Day or to the Second?

  1. #1
    Join Date
    Jun 2007
    Posts
    23

    Question Time Dimension in DW: To the Day or to the Second?

    OK, DW gurus ... I've gone through all the fun discussions on whether the time dimension in a star schema should have a surrogate key or not - in my experience, the answer has been easy: natural key using the DATE data type - best for partitioning, best for user queries, easy to understand. Not sure why people argue about this, at least in Oracle circles. But, anyway ...

    The question I have on a new warehouse implementation is whether we should have the time dimension store everything down to the second or keep it at the date level while having two fields in the fact table. We ultimately need the fact table to be granular to the second the event happened. So we had two options:
    1. Create a time field in the fact table and a time dimension whose PK is a DATE which has values down to the second. This means we get a rather large time dimension (worst case, 86,400 records per day - gets to be a lot after a few months).

    2. Create two fields in the fact table: a time_stamp (full date and time) and a date_stamp (trunc(time_stamp), so only has the date). Both are DATE data types. Then create a date dimension (instead of time), which hash only full days (so 24 rows per day). This means less rows in the dimension, but twice the storage in fact.

    Assuming the fact will have hundreds of millions of rows per day (and, obviously, several other fields - like 50) with transactions almost every second, which would be the better option here? Or is there a third?

    I was thinking option 2 since we can compress the fact table, but not the dimension. Plus, user queries that use the date_stamp field in fact don't have to worry about issues with time (they can freely use the "between" statement). Plus, the same dimension can be used for aggregated facts (things that will roll up to the full day).

    But maybe option 1 is more standard? And maybe we can partition the dimension for performance gains?

    Thanks in advance.

  2. #2
    Join Date
    Jan 2001
    Posts
    2,828
    Hi

    Are your end users going to analyze and slice the fact values down to seconds ?

    What does this datawarehouse support ? what kind of questions are the analyst likely to ask against this DW



    regards
    Hrishy

  3. #3
    Join Date
    Jun 2007
    Posts
    23
    Hrishy,
    Yes, according to business requirements, there will be times where analysis down to the specific event time will be necessary. It's a clickstream DW, so there will be user behavior analysis, traffic analysis, and marketing reports as well as user path analysis (specific to a session, what path did a user take before exiting or specific to a page, how many users went to that page and then went to page X right afterwards - things like that).

    Unfortunately, because this level of information is so new to them, they're in the mode of keeping the raw level until they have a better feel for what they really need to maintain.

    The current plan is to keep the raw data in a dimensional form and build aggregated fact tables from that. They will then decide on different archiving strategies for the raw vs. aggregated facts. And maybe in a much later phase (a year from now), they'll decide the raw is not necessary because their strategic and operational decisions can all be done from the aggregated facts ... but they're not there yet.

    Hope that gives you enough detail to help answer the questions.

  4. #4
    Join Date
    Jan 2001
    Posts
    2,828
    Hi

    I am not an expert and i maybe wrong but a few questions.

    With option 1 have you considered partitioning the Dimension table itself based on date ?(say weekly or a daily or a monthly partition)

    With option 2 can your users analyze your fact upto seconds ?

    regards
    Hrishy

  5. #5
    Join Date
    Dec 2002
    Location
    Bangalore ( India )
    Posts
    2,434
    when you can achive both partitioning and access to records at seconds level with option 1, why go with option 2?
    funky...

    "I Dont Want To Follow A Path, I would Rather Go Where There Is No Path And Leave A Trail."

    "Ego is the worst thing many have, try to overcome it & you will be the best, if not good, person on this earth"

  6. #6
    Join Date
    Jun 2007
    Posts
    23
    Quote Originally Posted by hrishy
    Hi

    I am not an expert and i maybe wrong but a few questions.

    With option 1 have you considered partitioning the Dimension table itself based on date ?(say weekly or a daily or a monthly partition)

    With option 2 can your users analyze your fact upto seconds ?

    regards
    Hrishy
    Yes, I considered partitioning the dimension (I think I mentioned that in the original post, sorry if not). Still, it's a huge table nonetheless.

    With option 2 users can go down to the second by using the separate time_stamp field (always querying against date_stamp for partition pruning, then using time_stamp for more details).

    Any thoughts?

  7. #7
    Join Date
    Jun 2007
    Posts
    23
    Quote Originally Posted by abhaysk
    when you can achive both partitioning and access to records at seconds level with option 1, why go with option 2?
    Because option 1 gives me a table with 86,400 records per day while option 2 gives me a table with 24 per day. I'm trying to figure out if the storage savings on the fact (7 extra bytes per row) is worth it if it means having a huge dimension table.

  8. #8
    Join Date
    Mar 2007
    Location
    Ft. Lauderdale, FL
    Posts
    3,555
    Quote Originally Posted by paulb
    Because option 1 gives me a table with 86,400 records per day while option 2 gives me a table with 24 per day. I'm trying to figure out if the storage savings on the fact (7 extra bytes per row) is worth it if it means having a huge dimension table.
    Now you got me really confused.

    Your option number one sounds like an aggregation to the second on your DIM table.
    Your option number two sounds like an aggregation to the hour on your DIM table.

    Did I get it right?

    If I got it right I'm really not sure what your DIM table is describing.

    For sure -as per your business specs- you have to have your FACT table at the transaction level.
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

  9. #9
    Join Date
    Jun 2007
    Posts
    23
    Quote Originally Posted by PAVB
    Now you got me really confused.

    Your option number one sounds like an aggregation to the second on your DIM table.
    Your option number two sounds like an aggregation to the hour on your DIM table.

    Did I get it right?

    If I got it right I'm really not sure what your DIM table is describing.

    For sure -as per your business specs- you have to have your FACT table at the transaction level.
    I'm sorry, I see what I said wrong in my reply - option 2 means 1 row per day, not 24 (sorry, my head went to hourly for some odd reason).

    Let me see if I can help clarify again just in case. In either option, we're talking about a dimension table.

    Option 1: call it the time dimension. Fields include time_stamp (to the second), date_stamp (rollup of time_stamp), day_of_week, day_name, month_name, year_name, holiday_flag, weekend_flag, quarter, etc ... all the typical fields you'd see in a time dimension.

    Option 2: call it the date dimension. Fields include date_stamp and all the other fields above. This option *excludes* the time_stamp field (date_stamp is a date, but it's trunc'd to the day). Data is at the daily level, not hourly.

    In either case, data in the fact table has to be down to the second (to the transaction level as you stated). So, option 1 is easy to understand (one field in the fact table which is a FK to the time dimension). Option 2 requires 2 fields in the fact table - one that is truncated to the day (that's the FK field to date dimension) and another that has it down to the second (this field is *not* a FK to any dimension, but is the original date/time of the transaction).

    Therefore, if we go with option 1, the time dimension will potentially (and very likely) have 1 row per every second of the day. If we go with option 2, the date dimension (not time) will have 1 row per day.

    Hope that helps - sorry for confusing things with the hourly. Thanks for the thoughts.

  10. #10
    Join Date
    Mar 2007
    Location
    Ft. Lauderdale, FL
    Posts
    3,555
    Got it.

    In our shop we have solved it by having a DIM_DATE dimension table as described in your option #2; table is fully populated, one row per day since 01/01/1932 -the year the company was founded- all the way thru 12/31/2049

    We see no reason to have a DIM_TIME table as described in your option #1; in our analysis this table would add nothing to any of our datamarts.
    Last edited by PAVB; 12-14-2007 at 01:56 PM.
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Click Here to Expand Forum to Full Width