800G or bust? Why more bandwidth isn’t the answer
The true measure of better network performance is not how many gigabits you have, but how you use them.
The highly anticipated innovations in 800G Ethernet will provide a bigger, faster pipe to handle increasingly data-hungry AI/ML, edge and IoT applications. What it won’t do is solve the performance issues and bottlenecks that already exist in 100G and 400G networks, which will only become magnified at higher speeds. Until data centers address the problem of 30-year-old centralized networking architectures designed for simpler traffic flows, the most likely outcome is more congestion and even more stranded CPU, GPU and memory resources.
In past, centralized network architectures have effectively managed north-south traffic to process basic pull requests from applications. In these spine-leaf/fat-tree architectures, throughput was everything – more bandwidth equaled better performance. That equivalency no longer holds for networks managing the complex traffic patterns in today’s advanced data centers. Linear, point-to-point traffic is at odds with multi-directional flows that are much more sensitive to congestion and variable tail latency, which in turn can delay overall application processing.
Tuning the network to meet variable speed and latency requirements – and manage multiple concurrent workloads with different characteristics – is necessary to meet the dynamic demands of next-generation applications. A distributed network fabric is inherently designed for complex, variable workflows with the intelligence to maximize bandwidth utilization at any speed. In deploying future-proof infrastructure capable of adapting to changing demands, there are six key factors to consider in your network design:
- Bandwidth efficiency
- Path diversity
- Oversubscription
- Short time frame reaction
- Parallel/MPI processing
- Disaggregation
Bandwidth efficiency
Adding bandwidth enables more data to travel over the network at a faster rate on its way to the switches. Unfortunately, in centralized switching architectures, increasing the speed makes it more likely that higher volumes of data will arrive at the same switch port at the same time. The result? Unintended congestion and workload latency while the switch buffers and processes each frame through available ports. It’s akin to widening a highway to allow more traffic flow but keeping the on and offramps the same, creating massive backups. Spikes in tail latency get worse as more clusters are added and parallel processing requirements increase.
Achieving bandwidth efficiency requires making the most of all available links to route the traffic, which is difficult to do in centralized spine-leaf architectures, where an estimated 90% of the traffic is routed through 20% of the links.*
Distributed network fabrics avoid data collisions by using adaptive multi-path routing to manage high-bandwidth flows, allowing a single flow to be distributed over multiple distinct network paths between the source and destination. Routing algorithms adaptively distribute packets across multiple paths on a per-packet and per-flow basis. Each path’s real-time end-to-end congestion is used to continually select the optimal path from the available links. The ability to support a mix of high-bandwidth and regular-bandwidth nodes in a single fabric provides even greater flexibility to add the precise amount of bandwidth required per node.
Path diversity
In centralized architectures, bandwidth utilization is always limited by the slowest link. For every packet of data there is a binary, 1:1 source/destination pair relationship for traffic traveling through the network. Traffic can only use one path to the destination. The performance of the path is limited by the slowest link, and that path must be shared by all applications on the node. As cluster scales increase, network layers and job counts also increase, making contention and path congestion almost a certainty.
Conversely, a distributed network fabric uses a high radix of optimal paths that share no common physical links. At scale, the traffic load is balanced with high path diversity, reducing contention and congestion hotspots to enable high resiliency, high throughput, high reliability and low latency.
Packets are broken down into FLITs that are interleaved into virtual channels and reconstructed at the destination. This approach ensures that high-priority messages are not blocked by large messages or bulk data transfers and prevents large data transfers from causing latency spikes. Critical small messages that are highly sensitive to latency are assigned ultra-high priority to ensure they are immune to network congestion. Ultra-high priority traffic is always serviced first, even under heavy congestion.
Oversubscription
Oversubscription has been the traditional model for increasing scale in centralized switching networks. The approach was considered reasonable when the network only had to accommodate single applications passing large packets in a north-south pattern. Compute power was the primary bottleneck, so it didn’t really matter what the network itself was doing.
By contrast, today’s networks must support multiple applications generating north-south and east-west traffic, along with parallel processing workloads. Compute power is no longer the issue – now the network has become the bottleneck. The centralized switching architecture itself creates route collisions and congestion, resulting in latency. Oversubscription compounds the problem for both scale and budget because it creates more buffering, reducing the effective bandwidth available to applications.
A multi-path network fabric eliminates oversubscription by providing multiple physical links from every source to every destination, preventing congestion by selecting the optimal path and adjusting based on network conditions.
*Based on empirical results from InfiniBand deployments.