My apologies; I must've gotten confused from another post. My guess is that the issue you're hitting is related to the default limit that Docker gives in shared memory - typically 64MB - which is insufficient on high numbers of MPI ranks. Try adding the flag:
--shm-size=512M
... To your 'docker run' command. Basically, every MPI process stores some information in /dev/shm (a mapped region of memory), but Docker defaults to a small amount there, and it's typically fine for a 4- or 8-core laptop, but not a 48-core system like yours. For the GNU / MPICH combination in use, 512MB is likely enough for 48 cores, but worst case try 1G as well.
And yes, if you have 48 cores, using only 6 or 12 will to give you the full performance. It's not always completely linear, since memory bandwidth matters a lot as well, but I'd say try the above and then set the number of tasks correctly. For the container, if you're using the Jupyter version the MAX_TASKS_PER_NODE should be set automatically, but if you're not using the Jupyter version than yes, you need to set it explicitly. Setting both those variables, plus the NTASKS to 48 should work.
If not, let me know and we'll try to solve it quickly.
Cheers,
- Brian