In addition to sinking billions into OpenAI, Microsoft spent hundreds of millions on hardware, much of it from Nvidia. Microsoft investment in ChatGPT doesn’t just involve money sunk into its maker, OpenAI, but a massive hardware investment in data centers as well which shows that for now, AI solutions are just for the very top tier companies. The partnership between Microsoft and OpenAI dates back to 2019, when Microsoft invested $1 billion in the AI developer. It upped the ante in January with the investment of an additional $10 billion. But ChatGPT has to run on something, and that is Azure hardware in Microsoft data centers. How much has not been disclosed, but according to a report by Bloomberg, Microsoft had already spent “several hundred million dollars” in hardware used to train ChatGPT. In a pair of blog posts, Microsoft detailed what went into building the AI infrastructure to run ChatGPT as part of the Bing service. It already offered virtual machines for AI processing built on Nvidia’s A100 GPU, called ND A100 v4. Now it is introducing the ND H100 v5 VM based on newer hardware and offering VM sizes ranging from eight to thousands of NVIDIA H100 GPUs. In his blog post, Matt Vegas, principal product manager of Azure HPC+AI, wrote customers will see significantly faster performance for AI models over the ND A100 v4 VMs. The new VMs are powered by Nvidia H100 Tensor Core GPUs (“Hopper” generation) interconnected via next gen NVSwitch and NVLink 4.0, Nvidia’s 400 Gb/s Quantum-2 CX7 InfiniBand networking, 4th Gen Intel Xeon Scalable processors (“Sapphire Rapids”) with PCIe Gen5 interconnects and DDR5 memory. Just how much hardware he did not say, but he did say that Microsoft is delivering multiple exaFLOPs of supercomputing power to Azure customers. There is only one exaFLOP supercomputer that we know of, as reported by the latest TOP500 semiannual list of the world’s fastest: Frontier at the Oak Ridge National Labs. But that’s the thing about the TOP500; not everyone reports their supercomputers, so there may be other systems out there just as powerful as Frontier, but we just don’t know about them. In a separate blog post, Microsoft talked about how the company started working with OpenAI to help create the supercomputers that are needed for ChatGPT’s large language model(and for Microsoft’s own Bing Chat. That meant linking up thousands of GPUs together in a new way that even Nvidia hadn’t thought of, according to Nidhi Chappell, Microsoft head of product for Azure high-performance computing and AI.. “This is not something that you just buy a whole bunch of GPUs, hook them together, and they’ll start working together. There is a lot of system-level optimization to get the best performance, and that comes with a lot of experience over many generations,” Chappell said. To train a large language model, the workload is partitioned across thousands of GPUs in a cluster and at certain steps in the process, the GPUs exchange information on the work they’ve done. An InfiniBand network pushes the data around at high speed, since the validation step must be completed before the GPUs can start the next step of processing. The Azure infrastructure is optimized for large-language model training, but it took years of incremental improvements to its AI platform to get there. The combination of GPUs, networking hardware and virtualization software required to deliver Bing AI is immense and is spread out across 60 Azure regions around the world. ND H100 v5 instances are available for preview and will become a standard offering in the Azure portfolio, but Microsoft has not said when. Interested parties can request access to the new VMs. Related content news AMD holds steady against Intel in Q1 x86 processor shipments finally realigned with typical seasonal trends for client and server processors, according to Mercury Research. By Andy Patrizio May 22, 2024 4 mins CPUs and Processors Data Center news Broadcom launches 400G Ethernet adapters The highly scalable, low-power 400G PCIe Gen 5.0 Ethernet adapters are designed for AI in the data center. By Andy Patrizio May 21, 2024 3 mins CPUs and Processors Networking news HPE updates block storage services The company adds new storage controller support as well as AWS. By Andy Patrizio May 20, 2024 3 mins Enterprise Storage Data Center news ZutaCore launches liquid cooling for advanced Nvidia chips The HyperCool direct-to-chip system from ZutaCore is designed to cool up to 120kW of rack power without requiring a facilities modification. By Andy Patrizio May 15, 2024 3 mins Servers Data Center PODCASTS VIDEOS RESOURCES EVENTS NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe