I'm looking for the pros and cons of any approaches people have used to handle virtual deletes.
Is it better to save off the records using triggers into a generic table using XML?
How does one handle child records?
Should all records simply be marked with a Deleted_Flg?
Are such records ignored using fine-grained transaction control?
Are such fields added to all indexes?
Do such fields get histograms?
Are such records periodically archived?
Are deleted records instead copied to exact copies of each table?
How are un-deletes handled?
virtual delete is something like this..when a employee leaves the organisation the employee record is deleted by the Human resources team from the front end application by clicking on the delete button (virtual delete) but in the backend a trigger executes and that record is moved into a history table..or a delete column is filled with Y
I have used things like using a trigger to move that record into a history table.for example when emp record is deleted a employee ecord is moved to employee_history table and the child table containing dependents is also moved to dependents_history table.
You can also use a column or a flag as you would like to call it. marked it as deleted.
If you have a global HR application that is implemented 1st method is preferrable..when you have millions of records..(and a high attrition rate ;-)
I am curious to know how did you think of a generic table and a xml format..i would not prefer XML unless you want to do some EDI or B2B stuff..my experience with XML is it is slow because of the parsing overhead..
Yes in the above scenarios
I would add indexes
I would generate histograms on indexed columns
Undeletes in this case should be treated as a separate insert.Additonaly there would be a timestamp column in the employee history table which would capture the time the record was deleted ,who deleted etc.(when employee left the organisation)if he rejoins back at a later time then it would be a brand new insert.
I have no specific requirements in mind, actually. I've simply decided to put the problem under a microscope and see what falls out.
The benefits of using XML would be the ability to create a generic solution. Given an ERwin template and code-generating macro code, one could set up triggers to move any and all deleted info into a generic 'holding' table using XML. Since most deletes are done one at a time, the performance should not be an issue. However, I have not yet tried this idea - what might I be missing?
With the flag approach, as I see it:
- Need a flag in each table - no big deal
- Need to populate the flag - probably with triggers - no big deal
- Need to check the flag with every statement - problem
- Can add WHERE to every single statement - maintenance issue
- Can wrap all tables in views that have WHERE - I don't like views
- Can use fine-grained access - but can't see correct plans
- I find this a MAJOR issue - what does everyone else think?
- Even once we solve the WHERE clause, we still have performance issues
- Histogram the column
- Will only use index for deleted records, since this is usually the minority
- Add column to most indexes
- Still not good selectivity by itself but now have a lot of changed indexes
- Is this sufficient?
- Still have lots more records than we need - is this an issue?
With the duplicate table approach...
- We have to duplicate a lot of tables - maintenance issue
- Now we are moving data - I guess this is the same with the XML approach
- If we are deleting a top record, we need to move the entire tree. How is this best accomplished
- If we are moving it to new tables, do these tables have RI?
- If so, we need to populate from top down, but deletes have to happen from bottom up
- If no RI, then how can we guarantee data integrity for un-deletes?
Anyway, that's a lot of what I've been thinking about. Feel free to respond to those thoughts directly or just add your own.
I prefer to use start date/end dates in the master or a start date/end date in a detail table.
The problem with moving the data to a "history" table is your "report on all records active or not" reporting becomes a UNION. Also, you have the potential of a duplicated unique key in the history table if the record is deleted, re-entered, and re-deleted.
A flag, on the other hand, doesn't give you much information. Sure, you can see what is active right now, but what about on 5/10/2002? Maybe the record was deleted, re-added, and re-deleted.
All in all, it depends on the specific situation. I try to avoid having an "active" bucket and a "history" bucket, but in some cases they make sense.
I can see the advantage of having a date field. But I'd say that it would be best served in addition to the flag field, don't you think? The problem with only having a date field is that the 'active' records would either have NULL or a fake, way-future date in that field. Now, I hate fake data, so that leaves NULLs, and NULLs are not indexed. So, if we ended up with a large percentage of records deleted, we would never be able to make an index to help ourselves. Now, with a flag field, we can add that field to any index we like if it will help. So, while a 'date of deletion' field would definitely provide useful information, I'm thinking that we would still need a flag - were we to go that route.
you could index your nulls with a function-based index, but if the ratio of deletes to actives get's high then it would be time to throw them in a history table i guess. Kind of a hybrid solution.
As soon as the possibility of "undeleting" a record get's raised as a business requirement, that would be time for me to say "well, you don't want to delete them at all then, do you?" The requirement kind of shifts to providing status flags for records, which is something more intuitively understood i think.
To my mind, "delete" as a business requirement means "gone forever". You want it back, you create it again, you don't "undelete"
WEEELL there's deleting and deleting . . . e.g. we keep a client's address history (kind of audit trail), old addresses are never un-virtually-deleted but we still need them.
I use the end-date method for this and other m2m tables. In the cases I have looked at over the years, the lack of an efficient index on end-date did not, in practice, slow things down. Either I retrieve 5 rows (using another index) and only display 1 of them - OK, I'm wasting milliseconds. Or I'm in a FTS regime where jugling with indexes and selecting on NULL or NOT NULL did not make any difference - sometimes FTS is the best way. There is a middle ground where this WILL make a difference, but I don't think I've been there.
"The power of instruction is seldom of much efficacy except in those happy dispositions where it is almost superfluous" - Gibbon, quoted by R.P.Feynman