How to Make a Chatbot with Image Recognition

At Pickaxe we've helped customers create and launch over 50,000 AI tools. We're well aware that AI chatbots are becoming increasingly common across the internet. Both for personal use and as commercial micro-SaaS products.

In some cases, builders find text-based chatbots limited. They want more multi-modality such as their end-users being able to upload files, links, youtube links, and yes, images too.

Fortunately, Pickaxe makes it very simple to turn a text-based chatbot into a vision chatbot by enabling image recognition. Giving your Pickaxe the ability to see and understand images is a simple process that can open up a lot of new use cases.

In this blog post, we'll walk through what vision-enabled chatbots mean, how you can enable image recognition in a chatbot, and even show you an example. You can also watch a step by step video tutorial that breaks how to create an AI vision chatbot in Pickaxe.

Let's get started. First off, let's break down what image recognition is and how it works. This quick understanding will help inform how to effectively use the feature in your Pickaxe.

Image recognition is a functionality provided by underlying AI vision models. These vision models are AI models that have been trained on massive sets of images and description captions. As a result these vision models have gained the ability to understand images. They can both understand the content of an image (i.e. the objects in the image) as well as recognize and read text within an image like a sign or website copy.

__wf_reserved_inherit
AI vision models can look at images and understand them.

For example, if you upload an image of a chef in front of a plate of spaghetti, the AI vision model will identify the spaghetti and the chef, as well as the setting, background elements, the tablecloth, even the expression of the chef (whether she is proud, disappointed, etc).

After the vision model looks at the image, it then generates a detailed text description of everything it “sees” in the image. These descriptions include a surprising amount of detail! Here's a vision model looking at a still from a Gordon Ramsay show.

__wf_reserved_inherit

After the vision model produces a text description, the description is then given to the language model you're using in Pickaxe. Obviously some details may be lost in the translation from image to text, but the vision model captures a large amount of the image's essence.

You can achieve a lot of cool things leveraging this workflow. You can get the language model to write detailed critiques of paintings, write pick-up lines based on screenshots of dating profiles, etc. It can even write code to fix problems in website interfaces!

Next, let's look at an example of a simple vision-enabled chatbot and how it works. Then we'll dive into how to make a vision-enabled chatbot on Pickaxe.

Example: Landing Page Consultant

Here’s a simple example of a chatbot with image recognition enabled that looks at website landing pages and then critiques them. We'll call it the Landing Page Reviewer. The functionality is fairly simple. Let's demonstrate by giving it a screenshot of the Craigslist interface. It will probably have more than a little of bit of design feedback!

__wf_reserved_inherit

As you can see, the functionality of this vision-enabled Pickaxe chatbot is very simple. It can:

  • Allow users to upload a screenshot of a website landing page.
  • Look at the uploaded screenshot with a point-of-view (UX design, marketing, branding, etc).
  • Use the uploaded screenshot to write a funny critique of it.

Importantly, the Pickaxe does not just look at the image. On top of the vision model, there is prompt that instructs the LLM to look at the image with a certain point-of-view and to offer actionable insights.

Now let’s break down how to enable image recognition in Pickaxe and make a vision-based chatbot.

Step 1. Write the prompt

As with every Pickaxe, the first step is to write a prompt in the builder. Writing a prompt for a vision chatbot is a fairly simple. The only difference is explaining in the prompt that the chatbot will “see” an image.

Here’s how we might write the prompt for the a version of our website review that critiques websites like an stand-up insult comic.

Prompt for vision-enabled chatbot on Pickaxe

As you can see the prompt is fairly standard. Nothing crazy. But in the prompt we add explicit instructions that it will be working with an image. We also wrote instructions for what it should do with the image.

Step 2. Enable Image Recognition

The next step is to actually allow your chatbot to “see” images. To do this, you'll need to do two sime things.

  1. Enable users to upload images
  2. Select GPT-4o or GPT-4o mini as your model.

This process is quick and painless on Pickaxe. To enable end-users to upload images, simply go to the Configure tab in the builder, and then click the check-box that allows users to upload their own files. Here's a screenshot below.

Then you'll want to go into the Prompt tab and make sure the model you're using is GPT-4o or GPT-4o mini. Currently, those are the only two models that work with image recognition on Pickaxe.

__wf_reserved_inherit
Make sure you use GPT-4o or GPT-4o mini for image recognition.

This will allow end-users to upload photos and images. Whenever a user uploads an image, it will be 'looked at', summarized in a descriptive caption, and then passed into your Pickaxe to be used in the context of your prompt.

Step 3. Testing & Refining

Next, you’ll want to test and refine your Pickaxe. Uploading images in the right-hand panel using the paperclip icon.

Take a look at the results and see if you like them. Adjust the prompt in the left-hand panel under the Prompt tab.

Your focus while testing should be to get the results you want. If you feel stuck, you can always post in the Prompt Engineering help section of our community forum where users workshop prompts and talk about prompt design techniques. You can also check out prompt engineering websites like https://promptengineering.org/.

As you text more and more, you might be surprised to see what a good prompt + image recognition can achieve together!

Step 4. Launch your Chatbot

Finally, you’ll want to launch your chatbot so that people can actually use it!

When building an AI tool on Pickaxe, you have several options for launching it. You can deploy it as: 

  • An axe page (standalone web-page at a Pickaxe URL)
  • An embedded Pickaxe (a white-labeled embed on a third-party website)
  • A Studio (a standalone white-labeled web-app at a custom domain)

Determining how to deploy your vision chatbot all depends on how you want people to use it.

Making your own vision-enabled AI chatbot

And that’s it! That’s all there is to the process.

To go over it again, simply: 

  • Think of a concept
  • Write a prompt
  • Enable end-user upload (under the Configure tab)
  • Use GPT-4o or GPT-4o mini as your model

If you follow those steps, you can make an AI chatbot with image recognition in the Pickaxe chatbot builder in a matter of mere minutes!

If you have any questions, please check out this community forum discussion about making vision-enabled chatbots. There you can ask questions about the process, share your own examples, and workshop prompts around the topic.