Most people immediately think of short message latency, or perhaps large message bandwidth when thinking about MPI.
But have you ever thought about what your MPI implementation has to dobeforeyour application even calls MPI_INIT?
Hint: it's pretty crazy complex, from an engineering perspective.
Think of it this way: operating systems natively provide a runtime system for individual processes. You can launch, monitor, and terminate a process with that OS's native tools. But now think about extending all of those operating system services to gang-support N processes exactly the same way one process is managed. And don't forget that those N processes will be spread across M servers / operating system instances.
In short, that's the job of a parallel runtime system: coordinate the actions of, and services provided to, N individual processes spread across M operating system instances.
It's hugely complex.
Parallel runtime environments have been a topic of much research over the past 20 years. There have been tremendous advancements made, largely driven by the needs of the MPI and greater HPC communities.
When I think of MPI runtime environments, I typically think of a spectrum:
Put differently, there are many services that an MPI job requires at runtime. Some entityhas to provide these services - either a native runtime system, or the MPI implementation itself (or a mixture of both).
Here's a few examples of such services:
That is alotof work to do.
Oh, and by the way, these tasks need to be done scalably and efficiently (this is where the bulk of the last few decades of research have been spent). There are many practical, engineering issues that are just really hard to solve at extreme scale.
For example, it'd be easy to have a central controller and have each MPI process report in (this was a common model for MPI implementations did in the 1990's). But you can easily visualize how that doesn't scale beyond a few hundred MPI processes - you'll start to run out of network resources, you'll cause lots of network congestion (to include contending with the application's MPI traffic), etc.
So use tree-based network communications, and distribute the service decisions among multiple places in the computational fabric. Easy, right?
Errr... no.
Parallel runtime researchers are still investigating the practical complexities of justhowto do these kinds of things. What service decisions can be distributed? How do they efficiently coordinate without sucking up huge amounts of network bandwidth?
And so on.
Fun fact: a sizable amount of the research into how to get to exascale involves figuring out how to scale the runtime system.
Just look at what is needed today: users are regularly runing MPI jobs with (tens of) thousands of MPI processes. Who wants an MPI runtime that takes30 minutesto launch a 20,000-process job? A user will (rightfully) view that as 29 minutes of wasted CPU time on 20,000 cores.
Indeed, each of the items in the list above are worthy of their own dissertation; they're all individually complex in themselves.
So just think about that the next time you run your MPI application: there's a whole behind-the-scenes support infrastructure in place just to get your application to the point where it can invoke MPI_INIT.