Dive Brief:
- Amid an industrywide race to deploy AI-optimized hardware, Google Cloud unveiled the fifth generation of its tensor processing unit chips, the company announced Tuesday. The Cloud TPU v5e chips are now available under preview.
- The new TPUs are designed to affordably support midsize and large-scale AI training and inference workloads, according to the company. “This is the most cost efficient and accessible cloud TPU to date,” Mark Lohmeyer, VP and GM for compute and ML infrastructure at Google Cloud, said Friday during a virtual press conference.
- To support more compute-intensive model training, the company said next month it will deploy the new A3 VMs, virtual machine servers built with Nvidia H100 GPUs.
Dive Insight:
Compute costs are a looming practical concern weighing over generative AI use cases.
The growing marketplace of available models rests atop racks of cloud servers running on high-powered chips and costly processors. Expensive infrastructure overhauls are underway, with the three biggest hyperscalers committed to expanding capacity.
Demand for graphics processing units has created a shortage, Sreekar Krishna, principal, partner and head of AI at consulting firm KPMG, told CIO Dive.
“There’s a GPU war going on between these big hyperscalers, but it’s only temporary,” Krishna said.
While graphics processing units are the gold standard for training large language models, there are alternatives for less taxing workloads. AWS and Google Cloud have developed more efficient proprietary chip technologies and Microsoft is reportedly working to do the same.
Google's new generation TPU can handle inference workloads, necessary for AI outputs, at less than half the cost of its predecessor, the announcement said. The processors will be available via Google Cloud’s Kubernetes and as a managed service through the hyperscaler’s Vertex AI platform.
The new TPUs yield two times better performance per dollar for training, as well as two and a half times better performance per dollar for inferencing, compared to the prior generation of the chips, Lohmeyer said Friday.
A new Multislice feature, used to power Google’s PaLM models, gives customers the option of scaling compute up or down based on workload requirements, the announcement said.
“We're enabling our customers to easily scale their AI models beyond the physical boundaries of a single TPU pod or a single TPU cluster,” said Lohmeyer. “A single large AI workload can now span multiple physical TPU clusters, scaling to literally tens of thousands of chips, and doing so very cost effectively.”
For the most demanding enterprise workloads, the new Nvidia-powered A3 processors provide three times faster training and ten times greater networking bandwidth than the previous generation A2 VMs, deployed three years ago, the company said.
Not all use cases will require that level of compute, Krishna said. In time, processing costs are expected to level out.
“As time evolves, we're going to have more specialized hardware that does deep learning better,” said Krishna, and the specific processor will be less important than the end results. “Do you care what processor is inside your iPhone or Apple Watch? I don't, because it does the job of a watch much better than my old watch ever did.”