A cluster of Dell PowerEdge R7615 servers featuring AMD EPYC processors achieved much stronger performance on multi-GPU, multi-node operations using Broadcom 100GbE NICs than the same cluster using 10GbE NICs
Conclusion
Many companies want to do LLM training on their internal data so they can use it to solve a host of business problems. LLM training uses low-level fundamental operations over distributed GPUs. When these operations perform efficiently, your LLM training takes much less time to complete and you can have your AI implementation operational sooner. Our tests looked at the performance of fundamental operations over distributed GPUs. We found that using Broadcom 100GbE BCM57508 NICs in a cluster of two Dell PowerEdge R7615 servers with AMD EPYC processors and NVIDIA GPUs provided dramatically lower latency and greater bandwidth than using only 10GbE networking, with no increase in power usage.