The Benefits of Adding Multimodal Search to a Conversational Shopping Concierge
Multimodal generative AI search refers to a search capability that leverages generative AI to process and respond to queries using multiple types of input modalities, such as text, images, voice, or even video. Unlike traditional search engines that typically rely on keywords or text-based queries, multimodal generative AI can understand and generate relevant responses across different forms of media.

August 12, 2024
Welcome back to the 2nd in a series of articles we’re producing on the effects that AI will have on the e-commerce shopping experience. The 1st article in the series discussed Conversational Shopping. Today we’ll focus on Multimodal Generative AI Search.
Multimodal generative AI search refers to a search capability that leverages generative AI to process and respond to queries using multiple types of input modalities, such as text, images, voice, or even video. Unlike traditional search engines that typically rely on keywords or text-based queries, multimodal generative AI can understand and generate relevant responses across different forms of media.
For example, with multimodal generative AI search, a user might upload a picture of an item, describe it in text, choose a color to match, or ask a question verbally, and the AI could generate search results that match the image, text, or voice query. This capability is particularly useful in contexts like e-commerce, where users might want to search for products by describing them, uploading a photo, or combining both methods to find exactly what they’re looking for. So what’s the big deal?
Speed!
For e-commerce, multimodal search offers numerous material benefits vs. traditional search. The 1st is accelerated discovery. A holy grail for e-commerce is near-real time search. The customer thinks of a product and the closest matches in the catalog appear magically. Ok, AI cannot read minds yet, but stay tuned for that…π¬. However, it can process product search requests fast, and if trained correctly, with extraordinary accuracy.
For example, a traditional search for a short, electric green party dress might look like this – click on women’s, click on dresses, click on short or mini, then click on the green(?) color in the sidebar color palette, then scroll for awhile to try to find a match. Let’s call that a minute, best case.
But what if the real search looked like the following – “I’m looking for a sexy mini dress made of lightweight cotton-poly blend that is perfect for a night out dancing for a warm summer night in Seattle. Oh yeah, and it should have a plunging neckline and short sleeves. And be under $50, pleaseππΌ!” Try typing that into a traditional e-commerce text box and see what pops up! Here’s how it looks with multimodal GenAI:
With a multimodal shopping assistant, this could produce the best matches in the catalog in 5-7 seconds (yes, of course depending on the shopper’s internet connection). Using complex math (a calculator?), that looks like 10 times faster than traditional search, or an order of magnitude!
1 + 1 = Better Still
But wait, you ask… what happened to the electric green component? That’s where multimodal search starts to work its magic. Just add a 64M color picker to the search process, pick the closest color match to your meaning of “electric green” and ask the same question. Drum roll, please… nearest possible matches pop up to this “combo” search. Same 5-7 secs.
Let’s Throw Image Search Into The Blender
What if the shopper is at that same night out and sees a dress that a friend is wearing and wants to find something similar online. With image similarity AI added into the mix, she could just use the photo component of the shopping assistant to snap an image of the dress and with an even shorter wait time (1-2 secs), bring up all of the nearest matches in the catalog to that image (not exact, of course, but closest matches available). She can tweak the color by using the color selector and use the conversational component to adjust the pricing requirements. Wow! (right?)
Add Hyper-Personalization To The Mix
Now imagine the above scenario with the same multimodal conversational shopping assistant trained on the shopping interests of that specific client, that understands their shopping personal preferences AND responds to them in a personalized conversational manner.
That will be the subject of our next post – Bringing It All Together. Adding Hyper-Personalization Into Conversational Shopping.
visualAI retail solutions delivers AI-based e-commerce products that fundamentally change the shopping experience. shopperGPT is the platform’s hyper-personalized, multimodal conversational shopping agent.