SQL query to weed out old duplicate entries
I have a database of job numbers and associated dates.
I want write an SQL query to select the duplicate job numbers and of those duplicates, select all but the newest of the jobs. The end result being a list of the old duplicate jobs, which can then safely be removed from the database.
select job, job_date from jobs j
where job_date < (
select MAX(job_date) from jobs m
where m.job = j.job);
"The person who says it cannot be done should not interrupt the person doing it." --Chinese Proverb
Ok, that has helped get the major roadblock out of the way: How to do a "for each item in a, do blah".
Thank you for that much; but I'm still not where I need to be. I still need to figure out how return all the jobs who's date's aren't max, since it seems that the MAX(dateField) returns the highest date VALUE (based on pur numerical comparison) not the latest date (ie, 9/9/02 is returned because it contains the the 'biggest numbers' rather than the latest date of 7/11/03)...
The other annoying thing, is the whole thing chokes up on the data when I tried the statements as suggested above. It just slows to a crawl for over 15 mins trying to process it. When it finally finished, it still seemed to return some weird values (and the dates are not accurate due to the 'MAX' problem described above. The database has ~11,000 entries in it. I'm not sure if that's too much for it to handle, but I would have thought it would be fine. The whole time I've taken to write this reply, it's not even 5% done with a simple 'A-Z sort' on the results that it returned earlier. Argh.
I'll have to work on this more Monday. It's almost quitting time. (So I appologise in advance for not responding further until then.)
It sounds like the max problem is one of your own making, if you are storing dates as strings or numbers. you will have to convert them to dates, using an appropriate format picture, and then apply the max function to it.
If you are on a high enough version you can use analytic functions to find the old jobs.
A composite index on job and datefield would be nice, but that date format is really going to hurt you.
partition by (job) max_datefield
datefield < max_datefield
Ok, I just looked at it, and for some reason the date/time field had been changed. I was able to change the field to the proper format, so that works now.
Now that the date/time is fixed, I ran a test on a small subset of the data, and the first bit of code DOES work...It just takes a LOT of time to run when there are all the records in the database. I'll have to just let it churn...
Thanks for ALL your help!
It's very much appreciated.
Click Here to Expand Forum to Full Width