Short Message Latency And Numa Effects

Servidores servidores

I've previously written a bunch about the effects of location, Location, LOCATION! on MPI applications.

Here's another subtle NUMA effect that a well-tuned MPI implementation can hide from you: intelligently distributing traffic between multiple network interfaces.

Yeah, yeah, most MPI implementations have had so-called "multi-rail" support for a long time (i.e., using multiple network interfaces for MPI traffic). But there's more to it than that.

For the purposes of this blog entry, assume that you have one network uplink per NUMA locality in your compute servers.

Cisco's upcoming ultra-low latency MPI transport in Open MPI, for example, examines each compute server at MPI job startup. It makes many setup decisions based on what it finds; let's describe two of these decisions in detail (sidenote: I'm using Cisco's "usNIC" as an example because I'm not aware of other transports making these same kinds of setup decisions)...

1. Limit short message NUMA distance

Traditional MPI implementation multi-rail support round-robins message fragments between all available network interfaces. This allows MPI to split very large messages across multiple network links, effectively multiplying available bandwidth.

The fact that some of the network interfaces may be NUMA-remote is irrelevant for large message. By definition, the latency of large messages is already high, such that the additional latency required to traverse inter-processor links (such as Intel's QPI network) is negligible compared to the overall transit time incurred by the message.

But for short messages, the usual argument for round robin schemes is not about bandwidth; it's about increasing message rates. Consider, however, if the round robin set includes NUMA-remote interfaces. In this case, the inter-processor links to reach those remote interfaces can either artificially decrease expected message rate improvements, or even outright cause over performance loss (e.g., due to NUMA/NUNA congestion).

For these reasons, by default, Cisco's usNIC Open MPI transportwill only use NUMA-near network interfaces for short messages. Large messages will, of course, be striped across all available network interfaces to get the expected bandwidth multiplication.

2. Dynamic short message threshold

Cisco's usNIC transport is a bit different than other Open MPI transports. Rather than having an upper-layer engine do fragmenting, it handles fragmenting and networking ACKing internally. We can therefore draw a fine line between what we want the upper layer engine to do and what we handle down in the transport layer, such as dynamically determining the length of a "short" message:

When there is one usNIC interface available, short = 150K
When there are more than one usNIC interfaces available, short = 50K

Skipping the complicated details, this dynamic determination of the length of a short message allows the optimization of our transport layer's interaction with the upper-layer MPI engine in the case of a single usNIC interface. For example, when there's only one usNIC interface, having a long "short" definition means that the upper layer engine will fragment messages less, and will result in fewer engine-to-transport-layer traversals.

In summary, these are two complicated implementation details that a well-tuned MPI implementation hides from you.

MPI is good for you!

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

Servidores servidores

Noticias calientes

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Short message latency and NUMA effects

Etiquetas calientes: UCS HPC mpi Open MPI USNIC VIC

Ordering Guide

Recursos recursos

Sobre nosotros

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

Servidores servidores

Noticias calientes

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Short message latency and NUMA effects

Etiquetas calientes: UCS HPC mpi Open MPI USNIC VIC

Ordering Guide

Recursos recursos

Sobre nosotros

Huawei CloudEngine S5731‑S48P4X Datasheet