In computer networking, the term link aggregation applies to various methods of combining (aggregating) multiple network connections in parallel in order to increase throughput beyond what a single connection could sustain and to provide redundancy in case one of the links should fail. A Link Aggregation Group (LAG) combines a number of physical ports together to make a single high-bandwidth data path, so as to implement the traffic load sharing among the member ports in the group and to enhance the connection reliability. – Wikipedia
More officially, the link aggregation we are concerned with is defined by IEEE 802.3ad-2000. This standard defines a method for grouping multiple physical Ethernet links into a single logical interface also known as a link aggregation group (LAG). IEEE 802.3ad link aggregation is an important technology for network operators because it enables increased network capacity without hardware upgrades and port redundancy without requiring user intervention in the event that a port fails. And, because IEEE 802.3ad is an established industry standard, it enables interoperability of multi-vendor devices in link aggregation scenarios.
Ok, that sounds like a great idea. So, what could possibly go wrong?
In this series of posts, we take a look at problems associated with link aggregation. More specifically, we explore how several types of link aggregation misconfigurations can be identified and corrected in an automated and effective way. In many situations, it may not be immediately obvious that a link aggregation misconfiguration even exists. However, unusual traffic patterns and/or sub-par performance of an aggregated link generally provide clues that something is amiss.
There are several scenarios in which misconfiguration of aggregated links can occur. In the more difficult-to-diagnose case, devices on opposite ends of a link should be connected but are not. This can happen, for example, when ports are added to a link aggregation group (LAG) on one side of a connection but (for whatever reason) the corresponding ports were not properly added (or not added at all) to the appropriate group on the other side of the connection. This case we will explore in Part 2 of this 2-part series. There is also the less difficult to diagnose scenario where an aggregated link has been successfully created but exhibits sub-par performance in terms of expected peak traffic carrying capacity. This can happen, for example, when the ports that are members of the LAG are not the ones assumed to be members of the group. Here in this Part 1 post, we explore this particular scenario.
By reference to the adjacent figure, we see a NetSpyGlass generated network architecture map corresponding to an actual production network. The map shows a number of network nodes consisting of core and distribution devices and their interconnecting links. On the map, the width of interconnecting links provides a visual indicator of the capacity (bandwidth) of the link. In this particular network, there are no device ports supporting speeds greater than 10G. Therefore, link aggregation is employed to create the high capacity (i.e. 120G and 160G) interconnects shown. It is worth noting that this map was generated by the automated discovery feature of NetSpyGlass. “Automated discovery” means that all SNMP-enabled devices in the network are easily added to the NetSpyGlass monitoring system via plain text configuration file. Once devices are added, NetSpyGlass will automatically determine device components, device state and network topology. And, network visualization is similarly automated in that manual intervention isn’t required to generate the network map as shown.
Now, if we focus attention on the upper left-hand corner of this map we see two core routers. The map shows a very high capacity aggregated link has been created between these core routers. However, the network operations team observed this aggregated link would never achieve expected levels of performance, particularly in terms of peak traffic carrying capacity.
The usual configuration checks revealed no errors. Traffic passed successfully across the aggregated link and apparent behavior was as expected except for the aforementioned sub-par performance. Only after conducting an automated network discovery were the network operations team able to “visually” diagnose the problem. As it turns out, a number of discrete 10G links between these core routers were simply not included in the LAG. Also quite visually evident was the high capacity aggregated link between these core routers into which these discrete 10G links should have been added. This is perhaps the simplest use case example of quickly resolving LAG misconfiguration via automated network discovery. The automated network discovery capabilities that NetSpyGlass provides out of the box save network operations teams a tremendous amount of time, angst and frustration when diagnosing network performance issues that would otherwise be quite challenging to resolve. And, the larger and more complex the network architecture, the more difficult it becomes to diagnose and resolve these LAG misconfiguration problems, absent the right tooling.
Again, the above scenario represents the best case LAG misconfiguration diagnosis and resolution scenario. NetSpyGlass automated device discovery and visualization provides immediate visual cues to indicate that link aggregation misconfiguration was caused by a number of physical 10G ports not being added to the LAG on both sides of the connection. In the next part of this 2-part post series, we take a detailed look at the more difficult-to-diagnose scenario – when physical ports are added to a LAG on one side of a connection but not the other side. And, as we will see, this scenario will appear “visually” to be properly configured. But again, the aggregated link performance fails to meet expectations. Rest assured, however, that with the right tooling even this more difficult kind of link aggregation misconfiguration can be readily diagnosed and resolved.