A table has a PK. Every PK compbination has average 75 rows for every month. Total number of rows around 50 million. So, it is calling for partitioning.
Making the partitions based on the date in the row seems most logical. The data in the table gets processed for a "month".
My question is, if I partition based on range in PK column and distribute the partiotions across datafiles (and hence disks) will I be distributing I/O ? So, data for a month is distributed across disks and the processing can be faster ?
Is this a correct view to partitioning ? Will it have considerable performance advantage over partitioning based on date range ? Obviously, the date based partitioning will be easier to maintain.
I think you have the concept down, and it sounds like your method of partitioning will speed your processing. Your performance depends on how your data will be distributed in the different partitions/datafiles and how it will be accessed.
For example, if you are processing data for an entire month and the data is partitioned on date by month, you will just be processing data from one partition at a time. However, if you spread that data over multiple partitions/datafiles, your performance will increase.
If the data in the partitions is evenly distributed, then range partition is a good choice. For example, every month the sales table gets around 100,000 rows; here the date or month will be the partition key.
On the other side, if the sales table gets 10,000,000 rows for the months Oct, Nov, and Dec and gets 50,000 rows for the remaining 9 months, then range partition is not a good choice. You have to choose HASH partition.
In both cases, the partition key will be sales date.
Handling 50M rows is a big task. Do the analysis caefully and decide which one is needed.