Asynchronous LLM API Calls in Python: A Complete Information - Uplaza - uPlaza

As builders and dta scientists, we regularly discover ourselves needing to work together with these highly effective fashions by way of APIs. Nevertheless, as our purposes develop in complexity and scale, the necessity for environment friendly and performant API interactions turns into essential. That is the place asynchronous programming shines, permitting us to maximise throughput and reduce latency when working with LLM APIs.

On this complete information, we’ll discover the world of asynchronous LLM API calls in Python. We’ll cowl all the things from the fundamentals of asynchronous programming to superior methods for dealing with complicated workflows. By the tip of this text, you will have a stable understanding of how one can leverage asynchronous programming to supercharge your LLM-powered purposes.

Earlier than we dive into the specifics of async LLM API calls, let’s set up a stable basis in asynchronous programming ideas.

Asynchronous programming permits a number of operations to be executed concurrently with out blocking the principle thread of execution. In Python, that is primarily achieved by way of the asyncio module, which offers a framework for writing concurrent code utilizing coroutines, occasion loops, and futures.

Key ideas:

Coroutines: Features outlined with async def that may be paused and resumed.
Occasion Loop: The central execution mechanism that manages and runs asynchronous duties.
Awaitables: Objects that can be utilized with the await key phrase (coroutines, duties, futures).

Here is a easy instance as an instance these ideas:

import asyncio
async def greet(identify):
    await asyncio.sleep(1)  # Simulate an I/O operation
    print(f"Hello, {name}!")
async def most important():
    await asyncio.collect(
        greet("Alice"),
        greet("Bob"),
        greet("Charlie")
    )
asyncio.run(most important())

On this instance, we outline an asynchronous perform greet that simulates an I/O operation with asyncio.sleep(). The most important perform makes use of asyncio.collect() to run a number of greetings concurrently. Regardless of the sleep delay, all three greetings will likely be printed after roughly 1 second, demonstrating the facility of asynchronous execution.

The Want for Async in LLM API Calls

When working with LLM APIs, we regularly encounter situations the place we have to make a number of API calls, both in sequence or parallel. Conventional synchronous code can result in vital efficiency bottlenecks, particularly when coping with high-latency operations like community requests to LLM providers.

Take into account a situation the place we have to generate summaries for 100 totally different articles utilizing an LLM API. With a synchronous method, every API name would block till it receives a response, probably taking a number of minutes to finish all requests. An asynchronous method, however, permits us to provoke a number of API calls concurrently, dramatically lowering the general execution time.

Setting Up Your Surroundings

To get began with async LLM API calls, you will must arrange your Python atmosphere with the mandatory libraries. Here is what you will want:

Python 3.7 or greater (for native asyncio assist)
aiohttp: An asynchronous HTTP shopper library
openai: The official OpenAI Python shopper (in case you’re utilizing OpenAI’s GPT fashions)
langchain: A framework for constructing purposes with LLMs (elective, however really helpful for complicated workflows)

You may set up these dependencies utilizing pip:

pip set up aiohttp openai langchain
Fundamental Async LLM API Calls with asyncio and aiohttp
Let's begin by making a easy asynchronous name to an LLM API utilizing aiohttp. We'll use OpenAI's GPT-3.5 API for instance, however the ideas apply to different LLM APIs as properly.
import asyncio
import aiohttp
from openai import AsyncOpenAI
async def generate_text(immediate, shopper):
    response = await shopper.chat.completions.create(
        mannequin="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.selections[0].message.content material
async def most important():
    prompts = [
        "Explain quantum computing in simple terms.",
        "Write a haiku about artificial intelligence.",
        "Describe the process of photosynthesis."
    ]
    
    async with AsyncOpenAI() as shopper:
        duties = [generate_text(prompt, client) for prompt in prompts]
        outcomes = await asyncio.collect(*duties)
    
    for immediate, lead to zip(prompts, outcomes):
        print(f"Prompt: {prompt}nResponse: {result}n")
asyncio.run(most important())


On this instance, we outline an asynchronous perform generate_text that makes a name to the OpenAI API utilizing the AsyncOpenAI shopper. The most important perform creates a number of duties for various prompts and makes use of asyncio.collect() to run them concurrently.
This method permits us to ship a number of requests to the LLM API concurrently, considerably lowering the entire time required to course of all prompts.
Superior Strategies: Batching and Concurrency Management
Whereas the earlier instance demonstrates the fundamentals of async LLM API calls, real-world purposes typically require extra refined approaches. Let's discover two essential methods: batching requests and controlling concurrency.
Batching Requests: When coping with a lot of prompts, it is typically extra environment friendly to batch them into teams reasonably than sending particular person requests for every immediate. This reduces the overhead of a number of API calls and might result in higher efficiency.

import asyncio
from openai import AsyncOpenAI
async def process_batch(batch, shopper):
    responses = await asyncio.collect(*[
        client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        ) for immediate in batch
    ])
    return [response.choices[0].message.content material for response in responses]
async def most important():
    prompts = [f"Tell me a fact about number {i}" for i in range(100)]
    batch_size = 10
    
    async with AsyncOpenAI() as shopper:
        outcomes = []
        for i in vary(0, len(prompts), batch_size):
            batch = prompts[i:i+batch_size]
            batch_results = await process_batch(batch, shopper)
            outcomes.lengthen(batch_results)
    
    for immediate, lead to zip(prompts, outcomes):
        print(f"Prompt: {prompt}nResponse: {result}n")
asyncio.run(most important())


Concurrency Management: Whereas asynchronous programming permits for concurrent execution, it is essential to manage the extent of concurrency to keep away from overwhelming the API server or exceeding charge limits. We are able to use asyncio.Semaphore for this function.

import asyncio
from openai import AsyncOpenAI
async def generate_text(immediate, shopper, semaphore):
    async with semaphore:
        response = await shopper.chat.completions.create(
            mannequin="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.selections[0].message.content material
async def most important():
    prompts = [f"Tell me a fact about number {i}" for i in range(100)]
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as shopper:
        duties = [generate_text(prompt, client, semaphore) for prompt in prompts]
        outcomes = await asyncio.collect(*duties)
    
    for immediate, lead to zip(prompts, outcomes):
        print(f"Prompt: {prompt}nResponse: {result}n")
asyncio.run(most important())


On this instance, we use a semaphore to restrict the variety of concurrent requests to five, making certain we do not overwhelm the API server.
Error Dealing with and Retries in Async LLM Calls
When working with exterior APIs, it is essential to implement sturdy error dealing with and retry mechanisms. Let's improve our code to deal with widespread errors and implement exponential backoff for retries.

import asyncio
import random
from openai import AsyncOpenAI
from tenacity import retry, stop_after_attempt, wait_exponential
class APIError(Exception):
    cross
@retry(cease=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def generate_text_with_retry(immediate, shopper):
    attempt:
        response = await shopper.chat.completions.create(
            mannequin="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.selections[0].message.content material
    besides Exception as e:
        print(f"Error occurred: {e}")
        elevate APIError("Failed to generate text")
async def process_prompt(immediate, shopper, semaphore):
    async with semaphore:
        attempt:
            outcome = await generate_text_with_retry(immediate, shopper)
            return immediate, outcome
        besides APIError:
            return immediate, "Failed to generate response after multiple attempts."
async def most important():
    prompts = [f"Tell me a fact about number {i}" for i in range(20)]
    max_concurrent_requests = 5
    semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async with AsyncOpenAI() as shopper:
        duties = [process_prompt(prompt, client, semaphore) for prompt in prompts]
        outcomes = await asyncio.collect(*duties)
    
    for immediate, lead to outcomes:
        print(f"Prompt: {prompt}nResponse: {result}n")
asyncio.run(most important())


This enhanced model consists of:

A customized APIError exception for API-related errors.
A generate_text_with_retry perform adorned with @retry from the tenacity library, implementing exponential backoff.
Error dealing with within the process_prompt perform to catch and report failures.

Optimizing Efficiency: Streaming Responses
For long-form content material technology, streaming responses can considerably enhance the perceived efficiency of your software. As an alternative of ready for all the response, you'll be able to course of and show chunks of textual content as they turn into accessible.

import asyncio
from openai import AsyncOpenAI
async def stream_text(immediate, shopper):
    stream = await shopper.chat.completions.create(
        mannequin="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    full_response = ""
    async for chunk in stream:
        if chunk.selections[0].delta.content material shouldn't be None:
            content material = chunk.selections[0].delta.content material
            full_response += content material
            print(content material, finish='', flush=True)
    
    print("n")
    return full_response
async def most important():
    immediate = "Write a short story about a time-traveling scientist."
    
    async with AsyncOpenAI() as shopper:
        outcome = await stream_text(immediate, shopper)
    
    print(f"Full response:n{result}")
asyncio.run(most important())


This instance demonstrates how one can stream the response from the API, printing every chunk because it arrives. This method is especially helpful for chat purposes or any situation the place you need to present real-time suggestions to the consumer.
Constructing Async Workflows with LangChain
For extra complicated LLM-powered purposes, the LangChain framework offers a high-level abstraction that simplifies the method of chaining a number of LLM calls and integrating different instruments. Let us take a look at an instance of utilizing LangChain with async capabilities:
This instance reveals how LangChain can be utilized to create extra complicated workflows with streaming and asynchronous execution. The AsyncCallbackManager and StreamingStdOutCallbackHandler allow real-time streaming of the generated content material.

import asyncio
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.supervisor import AsyncCallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
async def generate_story(matter):
    llm = OpenAI(temperature=0.7, streaming=True, callback_manager=AsyncCallbackManager([StreamingStdOutCallbackHandler()]))
    immediate = PromptTemplate(
        input_variables=["topic"],
        template="Write a short story about {topic}."
    )
    chain = LLMChain(llm=llm, immediate=immediate)
    return await chain.arun(matter=matter)
async def most important():
    matters = ["a magical forest", "a futuristic city", "an underwater civilization"]
    duties = [generate_story(topic) for topic in topics]
    tales = await asyncio.collect(*duties)
    
    for matter, story in zip(matters, tales):
        print(f"nTopic: {topic}nStory: {story}n{'='*50}n")
asyncio.run(most important())


Serving Async LLM Functions with FastAPI
To make your async LLM software accessible as an online service, FastAPI is an nice alternative on account of its native assist for asynchronous operations. Here is an instance of how one can create a easy API endpoint for textual content technology:

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from openai import AsyncOpenAI
app = FastAPI()
shopper = AsyncOpenAI()
class GenerationRequest(BaseModel):
    immediate: str
class GenerationResponse(BaseModel):
    generated_text: str
@app.put up("/generate", response_model=GenerationResponse)
async def generate_text(request: GenerationRequest, background_tasks: BackgroundTasks):
    response = await shopper.chat.completions.create(
        mannequin="gpt-3.5-turbo",
        messages=[{"role": "user", "content": request.prompt}]
    )
    generated_text = response.selections[0].message.content material
    
    # Simulate some post-processing within the background
    background_tasks.add_task(log_generation, request.immediate, generated_text)
    
    return GenerationResponse(generated_text=generated_text)
async def log_generation(immediate: str, generated_text: str):
    # Simulate logging or extra processing
    await asyncio.sleep(2)
    print(f"Logged: Prompt '{prompt}' generated text of length {len(generated_text)}")
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)


This FastAPI software creates an endpoint /generate that accepts a immediate and returns generated textual content. It additionally demonstrates how one can use background duties for extra processing with out blocking the response.
Greatest Practices and Widespread Pitfalls
As you're employed with async LLM APIs, maintain these finest practices in thoughts:

Use connection pooling: When making a number of requests, reuse connections to cut back overhead.
Implement correct error dealing with: At all times account for community points, API errors, and surprising responses.
Respect charge limits: Use semaphores or different concurrency management mechanisms to keep away from overwhelming the API.
Monitor and log: Implement complete logging to trace efficiency and establish points.
Use streaming for long-form content material: It improves consumer expertise and permits for early processing of partial outcomes.

         




 


I've spent the previous 5 years immersing myself within the fascinating world of Machine Studying and Deep Studying. My ardour and experience have led me to contribute to over 50 numerous software program engineering tasks, with a specific deal with AI/ML. My ongoing curiosity has additionally drawn me towards Pure Language Processing, a area I'm wanting to discover additional.