BLOGTECHNOLOGY

Exploring Google Cloud Networking Improvements for Generative AI Applications

Many enterprises are exploring how to integrate generative AI (gen AI) into their operations. According to the 2023 Gartner® report “We Shape AI, AI Shapes Us: 2023 IT Symposium/Xpo Keynote Insights” from October 16, 2023, most organizations are either using or planning to use everyday AI to enhance productivity. In the 2024 Gartner CIO and Technology Executive Survey, 80% of respondents indicated plans to adopt generative AI within the next three years.

Enterprises aiming to deploy large language models (LLMs) encounter unique networking challenges compared to traditional web applications. Generative AI applications exhibit distinct behaviors unlike most web applications. Typically, web applications have predictable traffic patterns, with requests and responses processed in milliseconds. In contrast, gen AI inference applications have varying request/response times due to their multimodal nature, leading to specific challenges. Additionally, LLM queries often utilize 100% of a GPU’s or TPU’s compute time, unlike typical request processing which runs in parallel. The high computational cost results in inference latencies ranging from seconds to minutes.

‘Typical’ web traffic:

  • Small request/response size

  • Many queries can be parallelized 

  • Process requests as soon as they arrive

  • Processing time is in ms

  • Similar requests can be served from cache 

  • Request cost managed within a backend

Gen AI traffic:

  • Large request/response due to multi-modal traffic

  • Single LLM query takes 100% TPU/GPU compute time

  • Requests wait for available compute

  • Variable processing times from seconds to minutes

  • Requests often generate unique content 

  • Traffic routed to cheaper/expensive model depending on request 

Therefore, conventional round-robin or utilization-based traffic management methods are typically unsuitable for gen AI applications. In order to ensure optimal end-user experiences and efficient utilization of scarce and expensive GPU and TPU resources, we have recently introduced a range of new networking features designed specifically for AI applications.

These innovations are integrated into Vertex AI and are now accessible through Cloud Networking, enabling their use across various LLM platforms of your choice.

Let’s delve into these enhancements.

1. Enhanced AI training and inference acceleration through Cross-Cloud Networking

According to an IDC report, 66% of enterprises prioritize generative AI and AI/ML workloads as key use cases for leveraging multi-cloud networking. This is due to the dispersed nature of data required for tasks such as model training, fine-tuning, retrieval-augmented generation (RAG), or grounding, which resides across various environments and needs remote access or replication to be available for LLM models.

Last year, we introduced Cross-Cloud Network, a service-oriented connectivity solution that facilitates seamless connectivity across clouds using Google’s global network infrastructure. This enables easier development and integration of distributed applications across multiple cloud environments.

Cross-Cloud Network encompasses several products designed to ensure secure, reliable, and SLA-backed connectivity for high-speed data transfers between clouds, crucial for managing the substantial data volumes necessary for gen AI model training. Among these products is Cross-Cloud Interconnect, which provides managed interconnectivity offering bandwidth options of 10 Gbps or 100 Gbps, supported by a 99.99% SLA and end-to-end encryption. Beyond facilitating secure data transfers for AI training, Cross-Cloud Network allows customers to deploy AI model inference applications across hybrid environments. For instance, it enables accessing models hosted in Google Cloud from application services operating in other cloud environments.

2. Model as a Service Endpoint: a specialized solution tailored for AI applications

The Model as a Service Endpoint addresses the specific needs of AI inference applications. Given the specialized nature of generative AI, many organizations offer models as a service for consumption by application development teams. This solution is designed specifically to support such scenarios.

The Model as a Service Endpoint embodies an architectural best practice consisting of three primary Cloud components:

1. App Hub recently launched and now generally available, serves as a centralized repository for tracking applications, services, and workloads across Cloud projects. It maintains comprehensive records of services to enhance their discoverability and reusability, encompassing AI applications and models.

2. Private Service Connect (PSC) ensures secure connectivity to AI models. This feature enables model producers to establish a PSC service attachment, allowing model consumers to securely access generative AI models for inference. Model producers can define access policies, controlling which entities can connect to their gen AI models. PSC also simplifies network access across different environments, even for consumers outside of Google Cloud.

3. Cloud Load Balancing incorporates various innovations tailored for efficiently directing traffic to large language models (LLMs). This includes an advanced AI-aware Cloud Load Balancing capability that optimizes traffic distribution to models. These functionalities are beneficial for both model producers and AI application developers, detailed further in subsequent sections of this blog.

3. Reduced inference latency using customized AI-aware load balancing

Many large language model (LLM) applications utilize platform-specific queues to receive and process user prompts efficiently. Maintaining consistent end-user response times requires minimizing the queue depths where pending prompts are stored. To achieve this, requests should be directed to LLM models based on their respective queue depths. Cloud Load Balancing now supports traffic distribution based on custom metrics, allowing for the use of application-specific metrics such as queue depth. These metrics are reported to Cloud Load Balancing using the Open Request Cost Aggregation (ORCA) standard in response headers. They play a critical role in influencing traffic routing and scaling of backend resources. For generative AI (gen AI) applications, queue depth can be configured as a custom metric. This enables Cloud Load Balancing to automatically distribute traffic evenly, ensuring that queue depths remain minimal. Consequently, this approach significantly reduces both average and peak latency during inference serving. In practical demonstrations, utilizing queue depth as a key metric for traffic distribution has led to latency improvements ranging from 5-10 times for AI applications. Later this year, we will incorporate traffic distribution based on custom metrics into Cloud Load Balancing, further enhancing its capabilities.

 

4. Enhanced traffic routing for AI inference applications

Improving Inference Reliability:

  • Internal Application Load Balancer with Cloud Health Checks: Ensures high availability of model service endpoints by monitoring the health of individual model instances and routing requests only to healthy models.
  • Global Load Balancing with Health Checks: Routes traffic to the nearest healthy model service endpoint within Google Cloud regions, optimizing latency for model consumers. This capability can also extend to multi-cloud or on-premises deployments.

Enhancing Model Efficacy:

  • Google Cloud Load Balancing Weighted Traffic Splitting: Diverts portions of traffic to different models or versions, facilitating A/B testing and progressive rollout of new model versions with blue/green deployments.

Load Balancing for Streaming:

    • Designed for gen AI requests that may take variable durations, such as those involving image processing. It optimizes traffic distribution based on the backend’s capacity to handle concurrent streams, ensuring efficient resource utilization and enhancing user experience for longer-running requests.

Upcoming Enhancements:

  • Load Balancing for Streaming: Scheduled for release later this year in Cloud Load Balancing, tailored to optimize traffic handling for extended-duration requests (> 10 seconds).

These capabilities collectively support reliable and efficient operation of gen AI applications across diverse deployment scenarios, including multi-cloud environments and varying latency requirements.

5. Improve gen AI deployment using Service Extensions

We are pleased to announce that Service Extensions for Google Cloud Application Load Balancers are now generally available, with availability planned for Cloud Service Mesh later this year. Service Extensions enable the integration of SaaS solutions or customizable actions within the data path, such as implementing custom logging or transforming headers.

Service Extensions offer several benefits for gen AI applications, enhancing the user experience in various ways. For instance, they can facilitate prompt blocking to prevent unwanted prompts from reaching backend models, thereby conserving valuable GPU and TPU processing resources. Additionally, Service Extensions enable routing requests to specific backend models based on criteria analyzed from request headers, ensuring that each prompt is handled by the most suitable model for optimal performance.

Leave a Reply

Your email address will not be published. Required fields are marked *