After switching to parallel, I started seeing this working.

Here's my sqlldr now:

sqlldr / control=${CNAME} errors=0 bad=${BADNAME} log=${LOG}_load data=${DATAFILE} direct=true parallel=true multithreading=true skip_index_maintenance=true

Total stream buffers loaded by SQL*Loader main thread: 206
Total stream buffers loaded by SQL*Loader load thread: 616

Is there some correlation between needing to have parallel / multithreading for this to work?

In addition by adding this option, in preliminary testing, I've loaded some small data sets that were taking ~ 2 mins, to now taking 20 secs using these options. I'm excited about the speed, however in full scale testing i'm nervous on how i/o cpu intensive doing this can be.

Anybody have any insight?