Cut back prices and latency with Amazon Bedrock Clever Immediate Routing and immediate caching (preview)

Cut back prices and latency with Amazon Bedrock Clever Immediate Routing and immediate caching (preview)


Voiced by Polly

December 5, 2024: Added directions to request entry to the Amazon Bedrock immediate caching preview. 

Immediately, Amazon Bedrock has launched in preview two capabilities that assist cut back prices and latency for generative AI functions:

Amazon Bedrock Clever Immediate Routing – When invoking a mannequin, now you can use a mix of basis fashions (FMs) from the identical mannequin household to assist optimize for high quality and value. For instance, with the Anthropic’s Claude mannequin household, Amazon Bedrock can intelligently route requests between Claude 3.5 Sonnet and Claude 3 Haiku relying on the complexity of the immediate. Equally, Amazon Bedrock can route requests between Meta Llama 3.1 70B and 8B. The immediate router predicts which mannequin will present the very best efficiency for every request whereas optimizing the standard of response and value. That is significantly helpful for functions equivalent to customer support assistants, the place uncomplicated queries could be dealt with by smaller, sooner, and more cost effective fashions, and sophisticated queries are routed to extra succesful fashions. Clever Immediate Routing can cut back prices by as much as 30 p.c with out compromising on accuracy.

Amazon Bedrock now helps immediate caching – Now you can cache ceaselessly used context in prompts throughout a number of mannequin invocations. That is particularly invaluable for functions that repeatedly use the identical context, equivalent to doc Q&A techniques the place customers ask a number of questions on the identical doc or coding assistants that want to keep up context about code information. The cached context stays obtainable for as much as 5 minutes after every entry. Immediate caching in Amazon Bedrock can cut back prices by as much as 90% and latency by as much as 85% for supported fashions.

These options make it simpler to scale back latency and stability efficiency with value effectivity. Let’s take a look at how you should use them in your functions.

Utilizing Amazon Bedrock Clever Immediate Routing within the console
Amazon Bedrock Clever Immediate Routing makes use of superior immediate matching and mannequin understanding strategies to foretell the efficiency of every mannequin for each request, optimizing for high quality of responses and value. Throughout the preview, you should use the default immediate routers for Anthropic’s Claude and Meta Llama mannequin households.

Clever immediate routing could be accessed via the AWS Administration Console, the AWS Command Line Interface (AWS CLI), and the AWS SDKs. Within the Amazon Bedrock console, I select Immediate routers within the Basis fashions part of the navigation pane.

Console screenshot.

I select the Anthropic Immediate Router default router to get extra info.

Console screenshot.

From the configuration of the immediate router, I see that it’s routing requests between Claude 3.5 Sonnet and Claude 3 Haiku utilizing cross-Area inference profiles. The routing standards defines the standard distinction between the response of the biggest mannequin and the smallest mannequin for every immediate as predicted by the router inner mannequin at runtime. The fallback mannequin, used when not one of the chosen fashions meet the specified efficiency standards, is Anthropic’s Claude 3.5 Sonnet.

I select Open in Playground to talk utilizing the immediate router and enter this immediate:

Alice has N brothers and she or he additionally has M sisters. What number of sisters does Alice’s brothers have?

The result’s rapidly offered. I select the brand new Router metrics icon on the fitting to see which mannequin was chosen by the immediate router. On this case, as a result of the query is relatively advanced, Anthropic’s Claude 3.5 Sonnet was used.

Console screenshot.

Now I ask an easy query to the identical immediate router:

Describe the aim of a 'hiya world' program in a single line.

This time, Anthropic’s Claude 3 Haiku has been chosen by the immediate router.

Console screenshot.

I choose the Meta Immediate Router to examine its configuration. It’s utilizing the cross-Area inference profiles for Llama 3.1 70B and 8B with the 70B mannequin as fallback.

Console screenshot.

Immediate routers are built-in with different Amazon Bedrock capabilities, equivalent to Amazon Bedrock Information Bases and Amazon Bedrock Brokers, or when performing evaluations. For instance, right here I create a mannequin analysis to assist me evaluate, for my use case, a immediate router to a different mannequin or immediate router.

Console screenshot.

To make use of a immediate router in an utility, I have to set the immediate router Amazon Useful resource Identify (ARN) as mannequin ID within the Amazon Bedrock API. Let’s see how this works with the AWS CLI and an AWS SDK.

Utilizing Amazon Bedrock Clever Immediate Routing with the AWS CLI
The Amazon Bedrock API has been prolonged to deal with immediate routers. For instance, I can record the present immediate routes in an AWS Area utilizing ListPromptRouters:

aws bedrock list-prompt-routers

In output, I obtain a abstract of the present immediate routers, much like what I noticed within the console.

Right here’s the complete output of the earlier command:

{
    "promptRouterSummaries": [
        {
            "promptRouterName": "Anthropic Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.26
            },
            "description": "Routes requests among models in the Claude family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1",
            "models": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-haiku-20240307-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
            },
            "standing": "AVAILABLE",
            "kind": "default"
        },
        {
            "promptRouterName": "Meta Immediate Router",
            "routingCriteria": {
                "responseQualityDifference": 0.0
            },
            "description": "Routes requests amongst fashions within the LLaMA household",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
            "fashions": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
            },
            "standing": "AVAILABLE",
            "kind": "default"
        }
    ]
}

I can get details about a selected immediate router utilizing GetPromptRouter with a immediate router ARN. For instance, for the Meta Llama mannequin household:

aws bedrock get-prompt-router --prompt-router-arn arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1

{
    "promptRouterName": "Meta Immediate Router",
    "routingCriteria": {
        "responseQualityDifference": 0.0
    },
    "description": "Routes requests amongst fashions within the LLaMA household",
    "createdAt": "2024-11-20T00:00:00+00:00",
    "updatedAt": "2024-11-20T00:00:00+00:00",
    "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
    "fashions": [
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
        },
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
        }
    ],
    "fallbackModel": {
        "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
    },
    "standing": "AVAILABLE",
    "kind": "default"
}

To make use of a immediate router with Amazon Bedrock, I set the immediate router ARN as mannequin ID when making API calls. For instance, right here I exploit the Anthropic Immediate Router with the AWS CLI and the Amazon Bedrock Converse API:

aws bedrock-runtime converse 
    --model-id arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1 
    --messages '[{ "role": "user", "content": [ { "text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?" } ] }]' 

In output, invocations utilizing a immediate router embody a brand new hint part that tells which mannequin was really used. On this case, it’s Anthropic’s Claude 3.5 Sonnet:

{
    "output": {
        "message": {
            "position": "assistant",
            "content material": [
                {
                    "text": "To solve this problem, let's think it through step-by-step:nn1) First, we need to understand the relationships:n   - Alice has N brothersn   - Alice has M sistersnn2) Now, we need to consider who Alice's brothers' sisters are:n   - Alice herself is a sister to all her brothersn   - All of Alice's sisters are also sisters to Alice's brothersnn3) So, the total number of sisters that Alice's brothers have is:n   - The number of Alice's sisters (M)n   - Plus Alice herself (+1)nn4) Therefore, the answer can be expressed as: M + 1nnThus, Alice's brothers have M + 1 sisters."
                }
            ]
        }
    },
    . . .
    "hint": {
        "promptRouter": {
            "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
        }
    }
}

Utilizing Amazon Bedrock Clever Immediate Routing with an AWS SDK
Utilizing an AWS SDK with a immediate router is much like the earlier command line expertise. When invoking a mannequin, I set the mannequin ID to the immediate mannequin ARN. For instance, on this Python code I’m utilizing the Meta Llama router with the ConverseStream API:

import json
import boto3

bedrock_runtime = boto3.consumer(
    "bedrock-runtime",
    region_name="us-east-1",
)

MODEL_ID = "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1"

user_message = "Describe the aim of a 'hiya world' program in a single line."
messages = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

streaming_response = bedrock_runtime.converse_stream(
    modelId=MODEL_ID,
    messages=messages,
)

for chunk in streaming_response["stream"]:
    if "contentBlockDelta" in chunk:
        textual content = chunk["contentBlockDelta"]["delta"]["text"]
        print(textual content, finish="")
    if "messageStop" in chunk:
        print()
    if "metadata" in chunk:
        if "hint" in chunk["metadata"]:
            print(json.dumps(chunk['metadata']['trace'], indent=2))

This script prints the response textual content and the content material of the hint in response metadata. For this uncomplicated request, the sooner and extra reasonably priced mannequin has been chosen by the immediate router:

A "Good day World" program is an easy, introductory program that serves as a primary instance to exhibit the elemental syntax and performance of a programming language, usually used to confirm {that a} growth atmosphere is about up accurately.
{
  "promptRouter": {
    "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
  }
}

Utilizing immediate caching with an AWS SDK
You should use immediate caching with the Amazon Bedrock Converse API. If you tag content material for caching and ship it to the mannequin for the primary time, the mannequin processes the enter and saves the intermediate leads to a cache. For subsequent requests containing the identical content material, the mannequin hundreds the preprocessed outcomes from the cache, considerably decreasing each prices and latency.

You possibly can implement immediate caching in your functions with just a few steps:

  1. Establish the parts of your prompts which are ceaselessly reused.
  2. Tag these sections for caching within the record of messages utilizing the brand new cachePoint block.
  3. Monitor cache utilization and latency enhancements within the response metadata utilization part.

Right here’s an instance of implementing immediate caching when working with paperwork.

First, I obtain three resolution guides in PDF format from the AWS web site. These guides assist select the AWS companies that suit your use case.

Then, I exploit a Python script to ask three questions in regards to the paperwork. Within the code, I create a converse() operate to deal with the dialog with the mannequin. The primary time I name the operate, I embody a listing of paperwork and a flag so as to add a cachePoint block.

import json

import boto3

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
AWS_REGION = "us-west-2"

bedrock_runtime = boto3.consumer(
    "bedrock-runtime",
    region_name=AWS_REGION,
)

DOCS = [
    "bedrock-or-sagemaker.pdf",
    "generative-ai-on-aws-how-to-choose.pdf",
    "machine-learning-on-aws-how-to-choose.pdf",
]

messages = []


def converse(new_message, docs=[], cache=False):

    if len(messages) == 0 or messages[-1]["role"] != "person":
        messages.append({"position": "person", "content material": []})

    for doc in docs:
        print(f"Including doc: {doc}")
        identify, format = doc.rsplit('.', maxsplit=1)
        with open(doc, "rb") as f:
            bytes = f.learn()
        messages[-1]["content"].append({
            "doc": {
                "identify": identify,
                "format": format,
                "supply": {"bytes": bytes},
            }
        })

    messages[-1]["content"].append({"textual content": new_message})

    if cache:
        messages[-1]["content"].append({"cachePoint": {"kind": "default"}})

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=messages,
    )

    output_message = response["output"]["message"]
    response_text = output_message["content"][0]["text"]

    print("Response textual content:")
    print(response_text)

    print("Utilization:")
    print(json.dumps(response["usage"], indent=2))

    messages.append(output_message)


converse("Examine AWS Trainium and AWS Inferentia in 20 phrases or much less.", docs=DOCS, cache=True)
converse("Examine Amazon Textract and Amazon Transcribe in 20 phrases or much less.")
converse("Examine Amazon Q Enterprise and Amazon Q Developer in 20 phrases or much less.")

For every invocation, the script prints the response and the utilization counters.

Including doc: bedrock-or-sagemaker.pdf
Including doc: generative-ai-on-aws-how-to-choose.pdf
Including doc: machine-learning-on-aws-how-to-choose.pdf
Response textual content:
AWS Trainium is optimized for machine studying coaching, whereas AWS Inferentia is designed for low-cost, high-performance machine studying inference.
Utilization:
{
  "inputTokens": 4,
  "outputTokens": 34,
  "totalTokens": 29879,
  "cacheReadInputTokenCount": 0,
  "cacheWriteInputTokenCount": 29841
}
Response textual content:
Amazon Textract extracts textual content and knowledge from paperwork, whereas Amazon Transcribe converts speech to textual content from audio or video information.
Utilization:
{
  "inputTokens": 59,
  "outputTokens": 30,
  "totalTokens": 29930,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}
Response textual content:
Amazon Q Enterprise solutions questions utilizing enterprise knowledge, whereas Amazon Q Developer assists with constructing and working AWS functions and companies.
Utilization:
{
  "inputTokens": 108,
  "outputTokens": 26,
  "totalTokens": 29975,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}

The utilization part of the response incorporates two new counters: cacheReadInputTokenCount and cacheWriteInputTokenCount. The overall variety of tokens for an invocation is the sum of the enter and output tokens plus the tokens learn and written into the cache.

Every invocation processes a listing of messages. The messages within the first invocation include the paperwork, the primary query, and the cache level. As a result of the messages previous the cache level aren’t presently within the cache, they’re written to cache. In keeping with the utilization counters, 29,841 tokens have been written into the cache.

"cacheWriteInputTokenCount": 29841

For the following invocations, the earlier response and the brand new query are appended to the record of messages. The messages earlier than the cachePoint will not be modified and located within the cache.

As anticipated, we are able to inform from the utilization counters that the identical variety of tokens beforehand written is now learn from the cache.

"cacheReadInputTokenCount": 29841

In my checks, the following invocations take 55 p.c much less time to finish in comparison with the primary one. Relying in your use case (for instance, with extra cached content material), immediate caching can enhance latency as much as 85 p.c.

Relying on the mannequin, you may set a couple of cache level in a listing of messages. To search out the fitting cache factors in your use case, strive totally different configurations and take a look at the impact on the reported utilization.

Issues to know
Amazon Bedrock Clever Immediate Routing is accessible in preview at present in US East (N. Virginia) and US West (Oregon) AWS Areas. Throughout the preview, you should use the default immediate routers, and there’s no extra value for utilizing a immediate router. You pay the price of the chosen mannequin. You should use immediate routers with different Amazon Bedrock capabilities equivalent to performing evaluations, utilizing information bases, and configuring brokers.

As a result of the inner mannequin utilized by the immediate routers wants to know the complexity of a immediate, clever immediate routing presently solely helps English language prompts.

Amazon Bedrock help for immediate caching is accessible in preview in US West (Oregon) for Anthropic’s Claude 3.5 Sonnet V2 and Claude 3.5 Haiku. Immediate caching can also be obtainable in US East (N. Virginia) for Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Professional. You possibly can request entry to the Amazon Bedrock immediate caching preview right here.

With immediate caching, cache reads obtain a 90 p.c low cost in comparison with noncached enter tokens. There aren’t any extra infrastructure prices for cache storage. When utilizing Anthropic fashions, you pay an extra value for tokens written within the cache. There aren’t any extra prices for cache writes with Amazon Nova fashions. For extra info, see Amazon Bedrock pricing.

When utilizing immediate caching, content material is cached for as much as 5 minutes, with every cache hit resetting this countdown. Immediate caching has been carried out to transparently help cross-Area inference. On this approach, your functions can get the fee optimization and latency advantage of immediate caching with the pliability of cross-Area inference.

These new capabilities make it simpler to construct cost-effective and high-performing generative AI functions. By intelligently routing requests and caching ceaselessly used content material, you may considerably cut back your prices whereas sustaining and even enhancing utility efficiency.

To study extra and begin utilizing these new capabilities at present, go to the Amazon Bedrock documentation and ship suggestions to AWS re:Publish for Amazon Bedrock. You will discover deep-dive technical content material and uncover how our Builder communities are utilizing Amazon Bedrock at group.aws.

Danilo



Leave a Reply

Your email address will not be published. Required fields are marked *