Tech

[Advanced] Build Agents with Vision Abilities Using OpenAI & AutoGen & Llava & Stable Diffusion

Explore AI vision agents with AutoGen and Llava, crafting images from text and iterative improvement, revolutionizing content creation and design.

Henry

Oct 19, 2023 • 8 min read

In the rapidly evolving field of artificial intelligence, the development of agents with vision capabilities is an exciting frontier. In this blog post, we'll explore a practical example of how you can create such an agent today. Our agent will be tasked with generating images based on text prompts and analyzing these images for improvement using two key components: AutoGen and Llava.

In this article, we will let Agent A to generate an image based on user input. Then let Agent B to review/rate/comment the result, sending back the result to Agent A. Agent A will re-generate the image until it meets the requirement, Lastly user can add a final touch to the result.

What are AutoGen and Llava?

AutoGen is a framework that enables the creation of multi-agent systems. In our case, we need to build a group of two different agents. One agent will specialize in image generation, while the other will focus on image analysis and providing feedback on how to improve the prompts.

Llava, on the other hand, is a multimodal model based on Llama 2. While it may not match the capabilities of GPT-4, it serves as a good alternative for this proof-of-concept project.

Here's a step-by-step guide on how to build an agent with vision capabilities using AutoGen and Llava:

Use Replicate

The optimal way is to use local LLM and Stable Diffusion to provide text and image generation.

However, for demo purposes, I will use Replicate to generate and analyze images. Anyway the approach is the same.

You will have to export env var in your VS Code terminal

export REPLICATE_API_TOKEN=your_replicate_api_token

Import necessary libraries and modules

from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
import autogen
import replicate
import requests
from datetime import datetime
import http.client
import json
import base64

Load the configuration list from a JSON file or environment variable

config_list = config_list_from_json(env_or_file="OPENAI_CONFIG_LIST")

Create a configuration dictionary for the Language Model (LLM)

llm_config = {
    "config_list": config_list,
    "request_timeout": 120
}

Define a function to review an image

def img_review(image_path, prompt):
    # Use the 'replicate' library to run an AI model for image review
    output = replicate.run(
        "yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
        input={
            "image": open(image_path, "rb"),
            "prompt": f"Please provide a description of the image and then rate, on a scale of 1 to 10, how closely the image aligns with the provided description. {prompt}?",
        }
    )

    # Concatenate the output into a single string and return it
    result = ""
    for item in output:
        result += item
    return result

Define a function to generate an image from text

def text_to_image_generation(prompt):
    # Use the 'replicate' library to run an AI model for text-to-image generation
    output = replicate.run(
        "stability-ai/sdxl:c221b2b8ef527988fb59bf24a8b97c4561f1c671f73bd389f866bfb27c061316",
        input={
            "prompt": prompt
        }
    )

    if output and len(output) > 0:
        # Get the image URL from the output
        image_url = output[0]
        print(f"Generated image for '{prompt}': {image_url}")

        # Download the image and save it with a filename based on the prompt and current time
        current_time = datetime.now().strftime("%Y%m%d%H%M%S")
        shortened_prompt = prompt[:50]
        filename = f"image/{shortened_prompt}_{current_time}.png"

        # Download the image using 'requests' library and save it to a file
        response = requests.get(image_url)
        if response.status_code == 200:
            with open(filename, "wb") as file:
                file.write(response.content)
            return f"Image saved as '{filename}'"
        else:
            return "The image could not be successfully downloaded and saved."
    else:
        return "The image generation process was unsuccessful."

Define a configuration for the assistant agents

llm_config_assistants = {
    "functions": [
        {
            "name": "text_to_image_generation",
            "description": "Utilize the most recent AI model to create an image using a given prompt and provide the file path to the generated image.",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "A detailed textual prompt that provides a description of the image to be generated.",
                    }
                },
                "required": ["prompt"],
            },
        },
        {
            "name": "image_review",
            "description": "Examine and assess the image created by AI according to the initial prompt, offering feedback and recommendations for enhancement.",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "The original input text that served as the prompt for generating the image.",
                    },
                    "image_path": {
                        "type": "string",
                        "description": "The complete file path for the image, including both the directory path and the file extension.",
                    }
                },
                "required": ["prompt", "image_path"],
            },
        },
    ],
    "config_list": config_list,
    "request_timeout": 120
}

Create assistant agents with specific roles

img_gen_assistant = AssistantAgent(
    name="text_to_img_prompt_expert",
    system_message="As an expert in text-to-image AI models, you will utilize the 'text_to_image_generation' function to create an image based on the given prompt and iterate on the prompt, incorporating feedback until it achieves a perfect rating of 10/10.",
    llm_config=llm_config_assistants,
    function_map={
        "image_review": img_review,
        "text_to_image_generation": text_to_image_generation
    }
)

img_critic_assistant = AssistantAgent(
    name="img_critic",
    system_message="In the role of an AI image critic, your task is to employ the 'image_review' function to evaluate the image generated by the 'text_to_img_prompt_expert' using the original prompt. You will then offer feedback on how to enhance the prompt for better image generation.",
    llm_config=llm_config_assistants,
    function_map={
        "image_review": img_review,
        "text_to_image_generation": text_to_image_generation
    }
)

Create a user proxy agent

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="ALWAYS",
)

Create a group chat with the defined agents

groupchat = autogen.GroupChat(
    agents=[user_proxy, img_gen_assistant, img_critic_assistant],
    messages=[],  # The initial messages in the chat
    max_round=10  # Maximum rounds of conversation
)

Create a group chat manager

manager = autogen.GroupChatManager(
    groupchat=groupchat,
    llm_config=llm_config
)

Start the conversation by having the user proxy initiate the chat

user_proxy.initiate_chat(
    manager, message="Generate a photorealistic image of a corgi riding a skateboard.")

Conclusion

Building agents with vision capabilities is an exciting venture, and with the right tools and frameworks, it's increasingly accessible. AutoGen, Llava, and the power of the OpenAI API allow you to create agents that generate images based on text prompts, analyze those images, and iteratively improve the results. This technology has the potential to revolutionize various industries, from content creation to design and beyond.

For a more detailed guide on using AutoGen and Llava, you can dig more and get in-depth insights into these models. If you're exploring large-scale deployment, consider hosting your own AI models on platforms like RunPod. While initial costs might be higher, it can be a cost-effective solution for high-volume projects.

This article is just a glimpse of the possibilities that the fusion of text and image models offers. The future of AI is filled with exciting opportunities, and the development of agents with vision capabilities is a promising step forward.

Demo Terminal Output

Note: All processes are automatic, except for the final touch by user's input

user_proxy (to chat_manager):

Generate a photo realistic image of a corgi riding skateboard.

--------------------------------------------------------------------------------
text_to_img_prompt_expert (to chat_manager):

***** Suggested function Call: text_to_image_generation *****
Arguments: 
{
  "prompt": "A photo realistic image of a corgi riding skateboard."
}
*************************************************************

--------------------------------------------------------------------------------

>>>>>>>> EXECUTING FUNCTION text_to_image_generation...
generated image for A photo realistic image of a corgi riding skateboard.: https://pbxt.replicate.delivery/fgw1btPdiUQHWCYdDbc3S2Zl8He7ZIOSAJi9x9g9fXPx6feNC/out-0.png
text_to_img_prompt_expert (to chat_manager):

***** Response from calling function "text_to_image_generation" *****
Image saved as 'image/A photo realistic image of a corgi riding skateboa_20231019233802.png'
*********************************************************************

--------------------------------------------------------------------------------

img_critic (to chat_manager):

***** Suggested function Call: image_review *****
Arguments: 
{
  "prompt": "A photo realistic image of a corgi riding skateboard.",
  "image_path": "image/A photo realistic image of a corgi riding skateboa_20231019233802.png"
}
*************************************************

--------------------------------------------------------------------------------

>>>>>>>> EXECUTING FUNCTION image_review...
text_to_img_prompt_expert (to chat_manager):

***** Response from calling function "image_review" *****
The image shows a corgi riding a skateboard down a street. The dog is positioned in the center of the skateboard, and it appears to be enjoying the ride. The scene is quite similar to the text prompt, with the corgi being the main focus and the skateboard being the primary mode of transportation. On a scale of 1 to 10, the image is a 9 in terms of similarity to the text prompt.
*********************************************************

--------------------------------------------------------------------------------
text_to_img_prompt_expert (to chat_manager):

Based on the feedback, the image is already quite close to the text prompt. However, to improve it further, we can make a few adjustments:

1. Enhance the realism: We can add more texture and details to the fur of the corgi and the skateboard to make them look more realistic.

2. Adjust the composition: We can experiment with different angles and perspectives to make the image more dynamic and engaging.

Let me generate an improved image based on these suggestions.
***** Suggested function Call: text_to_image_generation *****
Arguments: 
{
  "prompt": "A photo realistic image of a corgi riding a skateboard with enhanced fur texture and dynamic composition."
}
*************************************************************

--------------------------------------------------------------------------------

>>>>>>>> EXECUTING FUNCTION text_to_image_generation...
generated image for A photo realistic image of a corgi riding a skateboard with enhanced fur texture and dynamic composition.: https://pbxt.replicate.delivery/nFph0t4tI06KGlOqj754lK5oSbpcAUUyRUuj1oq99Gyjf33IA/out-0.png
text_to_img_prompt_expert (to chat_manager):

***** Response from calling function "text_to_image_generation" *****
Image saved as 'image/A photo realistic image of a corgi riding a skateb_20231019233857.png'
*********************************************************************

--------------------------------------------------------------------------------

Provide feedback to chat_manager. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: Let the background be a trail in a park
user_proxy (to chat_manager):

Let the background be a trail in a park

--------------------------------------------------------------------------------
text_to_img_prompt_expert (to chat_manager):

***** Suggested function Call: text_to_image_generation *****
Arguments: 
{
  "prompt": "A photo realistic image of a corgi riding a skateboard on a trail in a park with enhanced fur texture and dynamic composition."
}
*************************************************************

--------------------------------------------------------------------------------

>>>>>>>> EXECUTING FUNCTION text_to_image_generation...
generated image for A photo realistic image of a corgi riding a skateboard on a trail in a park with enhanced fur texture and dynamic composition.: https://pbxt.replicate.delivery/zfLfVwaTyoglP0OreHef3wQgeUDdn7cc2mCkHfztk6tbbA43IA/out-0.png
img_critic (to chat_manager):

***** Response from calling function "text_to_image_generation" *****
Image saved as 'image/A photo realistic image of a corgi riding a skateb_20231019234144.png'
*********************************************************************

--------------------------------------------------------------------------------

Copyright statement: Unless otherwise stated, all articles on this blog adopt the CC BY-NC-SA 4.0 license agreement. For non-commercial reprints and citations, please indicate the author: Henry, and original article URL. For commercial reprints, please contact the author for authorization.