An Introduction to Cloudflare Workers AI

Published on: Monday, December 2, 2024

Introduction

Imagine adding Generative AI capabilities to your web applications without worrying about infrastructure or scaling complexities. With Cloudflare Workers AI, you can do just that—directly at the edge! In this blog, we’ll explore how you can seamlessly integrate Generative AI into your projects in less than 25 lines of code. Whether you're looking to create dynamic content, enhance user interactions, or build smarter workflows, Cloudflare Workers AI simplifies it all with unmatched speed and efficiency. Let’s dive in and see how you can harness this cutting-edge technology to revolutionize your applications.

Overview

In this blog we are going to look into:

Writing a simple streaming server side API using Workers AI to run Inference with LLM - gemma-2b-it-lora

Use AI Gateway to add capabilities like - caching responses, analytics, real-time logging, and rate limiting

Integrate streaming response into Nuxt client side application

Let's dive right in! 🚀

Part 1 - Building a Streaming Server API

When dealing with LLMs the preferred method to handle Text Generation responses is Streaming. LLMs generate responses sequentially, using a process of repeated inference. The full output of an LLM can often be a sequence of hundreds if not thousands of tokens for a single inference/prediction task. While it takes only a few short seconds to generate the first of the tokens in the sequence, generating the full output can often be a bit time consuming.

With Streaming responses you can start displaying your output to your user in a few short moments rather than waiting for the entire response to have generated. In this section, we will look how to implemnting this.

Let's first create a simple streaming server API. I am doing this within a Nuxt Application with Typescript support

server/api/ai-test.post.ts

export default defineEventHandler(async () => {
  const ai = hubAI(); // access AI bindings 
  const responseStream = await ai.run('@cf/google/gemma-2b-it-lora', {
    stream: true,
    messages: [
      { role: "system", content: "You are Alan Watts" },
      {
        role: "user",
        content: "Write 1 inspiring quote that is less than 50 words about hard honest work as if you were Alan Watts",
      },
    ]},
   );
  return sendStream(event, responseStream as ReadableStream);
})

With this you are ready to consume this as a stream from the client side. On to the next step.

Part 2 - Adding AI Gateway

As explained above, using Cloudflare AI Gateway we can enable features like:

Logging

Get insights on requests and errors

Analytics

View metrics such as number of requests, tokens and costs

Caching Responses

Respond from cloudflare cache instead of model provider hence minimizing costs

Rate Limiting

Control Scaling based on number of requests your application receives

Request Fallbacks

Improve resiliency by defining request retries and model fallbacks

Custom providers

Workers AI, OpenAI, Azure OpenAI, Huggingface, Replicate and more work seamlessly with AI Gateway

To do this we need to:

Create a AI Gateway through Cloudflare Console. Name it whatever you like and add some rate limiting configs to it.

We are now ready to proceed to our next step.

Update the above code with support for newly created AI Gateway

server/api/ai-test.post.ts

export default defineEventHandler(async () => {
  const ai = hubAI(); // access AI bindings 
  const responseStream = await ai.run('@cf/google/gemma-2b-it-lora', {
    stream: true,
    messages: [
      { role: "system", content: "You are Alan Watts" },
      {
        role: "user",
        content: "Write 1 inspiring quote that is less than 50 words about hard honest work as if you were Alan Watts",
      },
    ]},
    {
      gateway: {
        id: 'persona-portfolio-work',
        skipCache: false,
        cacheTtl: 3600,
      }
  });
  return sendStream(event, responseStream as ReadableStream);
})

And voila! Now you have a streaming API that you can deploy that leverages Workers AI. Onto our next step.

Part 3 - Integrating it on client side

At this point we are ready to integrate our streaming responses on the client side. The following Nuxt Component code shows how I integrate it on a client side (with hybrid rendering/hyrdation). You should be able to consume it from any JS based framework in a similar manner.

components/AiGeneratedQuotes.vue

<script setup>
import { ref, onMounted } from 'vue';
import { animate, stagger } from 'motion';

const quote = ref('');
const error = ref(null);
const isLoading = ref(true);

async function getQuote(){
  try {
    const url = "/api/ai-test";
    const response = await $fetch<ReadableStream>(url, {
      method: 'POST',
      body: {},
      responseType: 'stream',
    })
    const reader = (response as ReadableStream).pipeThrough(new TextDecoderStream()).getReader()
    let buffer = '';
    while (true) {
      const { done, value } = await reader.read()
      // console.log(value);
      if (done) {
        if (buffer.trim()) {
          console.warn('Stream ended with unparsed data:', buffer);
        }
        break
      }
      buffer += value;
      const lines = buffer.split('\n');
      buffer = lines.pop() || '';
      for (const line of lines) {
        isLoading.value = false;
        if(line.startsWith('data: ')) {
          const data = line.slice('data: '.length).trim();
          if (data === '[DONE]') break;

          try {
            const jsonData = JSON.parse(data);
            if (jsonData.response) {
              quote.value += jsonData.response;
            }
          } catch (parseError) {
            console.warn('Error parsing JSON:', parseError);
          }
        }
      }
    }
  } catch (err) {
    console.error(`Error: ${err}`);
    error.value = `An error occurred.`;
  } finally {
    isLoading.value = false;
  }
}

onMounted(() => {
    animate(
        '.loading-container', 
        { opacity: [0, 1], y: ["40%", "0%"] }, 
        { duration: 1, delay: stagger(0.4), repeat: Infinity }
    );
    getQuote();
});
</script>

<template>
  <div class="quote-container">
    <!-- Motion for Loading -->
    <div v-if="isLoading" class="loading-container">
        <div>Generating Quote...</div>
    </div>
    <div v-else-if="error">
      <p>Error in generating quote: {{ error }}</p>
    </div>
    <!-- Motion for Quote -->
    <div v-else class="quote-card-container">
      <p class="quote">{{ quote }}</p><br/>
      <p class="author">— <a href="https://developers.cloudflare.com/workers-ai/models/gemma-2b-it-lora/" target="_blank" class="text-blue-300">@cf/google/gemma-2b-it-lora</a> as Alan Watts</p>
    </div>
    <!-- <h1>Quote</h1> -->
    <!-- <p v-if="error">Error: {{ error }}</p>
    <p v-else-if="quote">{{ quote }}</p>
    <p v-else class="loading-container">Loading...</p> -->
  </div>
</template>

<style scoped>
.quote-container {
  display: flex;
  justify-content: center;
  align-items: center;
  height: auto;
  padding-top: 3rem;
}

.loading-container,
.quote-card-container {
  display: flex;
  flex-direction: column;
  align-items: end;
  height: auto;
}

.quote {
  font-size: 1.25rem;
  font-weight: 600;
  color: #7c7c7c;
}

.author {
  /* margin-top: .25rem; */
  font-size: 1rem;
  font-style: italic;
  color: #b77401;
}
</style>

Load the component on a page and you should see something as this:

Client side rendering of streaming quotes

As you can see from the above, we now have fully functional client side rendering of streaming responses from our Workers AI implementation. Try refreshing the page (post deployment) and you should be able to see caching and rate limiting behaviors.

Conclusion

Pardon the crude implementation. This is a bare-bone implementation of the Streaming API responses from Workers AI. Please refer to each model's documentation on futher input parameters that allow you to control text generation with great granularity. Hope this is of help and of someone's use.