What is VIBE3D?

BACK

MS In Architectural Technologies Thesis SUMMER 2025

Links:

DEMO PROJECT

VIBE3D is a gesture- and voice-driven agentic plugin for Blender that reframes how designers interact with 3D digital tools. It merges natural hand movements and natural voice with agentic and generative AI to create a fluid, continuous loop between thought, action, and form.

Transforms modeling from manual labor into an intuitive, expressive experience.

Is both product oriented but generalizable to any requirement.

You can quickly test and tweak designs just by saying what you want: CONJURE listens, understands, and adapts.

A DIFFERENT APPROACH TO 3D

3D Modeling is often one of the most resource-intensive tasks in creative endeavors, yet:

Starting in January 2025, 3D capable models have stormed the market.
From low resolution to extreme fidelity, combined with retopology and multi-view texture generation, any idea can be quickly executed, edited and deployed. What took days to create now takes minutes.
The creative industry is missing an interface to actually use these models.

A RESPONSIVE INTERFACE

VIBE3D gives you the ability to create 3D models can only be production ready by control and specification. Instead of toiling on a given mesh, VIBE3D specifies the output naturally; You speak and talk your model into existence; you conjure it.

Traditional 3D modeling depends on keyboards and precise mouse work, creating barriers for many users. VIBE3D replaces this with gesture- and voice-based control, making design intuitive and playful.

The UI intentionally displays as little information as possible, letting agents extrapolate information while the user input and attention remains minimal.

Users can “conjure” models through natural interaction, focusing on ideas rather than technical mechanics. Immediate visual feedback, AI refinement, and a conversational interface turn modeling into an exploratory, game-like experience, lowering learning curves and emphasizing creativity over tool mastery.

VOICE AND PROMPTING

The backend agent listens for voice commands and prompting, and continuously captures the current mesh state and the user’s conversation.
It interprets gestures and spoken prompts in real time, matching the evolving 3D geometry with the intended aesthetic.
The model is fine-tuned for generating excellent FLUX (ImageGen) prompts, that then can be translated into 3D Gen Models.
They are highly precise and structured outputs, with controlled variables (such as setting and lighting) and are automatically formatted as strings in the specified JSON schema.

{
"subject": {
"name": "stackable chair",
"form_keywords": ["curved shell", "mono-block", "faceted"],
"material_keywords": ["brushed aluminum", "matte ABS", "polished steel"],
"color_keywords": ["graphite gray", "satin sand", "electric blue"]
}
}

This ensures the system translates intent into actionable design parameters for shape, material, and color, maintaining clarity and consistency throughout the generative workflow.

GESTURE AS SPATIAL SPECIFIER

VIBE3D uses machine vision to translate micro-gestures, subtle and intuitive hand movements, into actionable 3D operations. Users can shape, pull, and refine geometry almost effortlessly using sculpting brushes such as Pinch, Grab, Smooth, Inflate, and Flatten, allowing for the laziest form of modeling without interrupting creative flow.

This opens the possibility for new plastic exploration as, even if pointers provide precision, they collapse the whole tridimentional intentionality of hands into a single point.

Gestures also offer immediate embodied spatial feedback. As the user moves their hands, the mesh reacts in real time, making it easy to understand proportion, depth, and form dynamically. This feedback loop turns abstract ideas into tangible forms, supporting exploration, iteration, and experimentation in a way that keyboard and mouse interfaces cannot.

PRECEDENTS

The starting point for research was the intention to develop something around the idea of gesture by building over Google’s MediaPipe library. This machine vision technology allowed me to quickly map webcam finger input inside my blender enviroment, and build from them.

I started experimenting with different inputs, using both my whole body and fingertips over a context model to generate “strips” of movement. These indexical experiments were a success, as they allowed me to create unexpected outputs that yet felt intuitive.

From my experience creating AltaSea Robot Lab, I understood the power of AI agents and language to embed design intention to 2D representations, and with the explosion of image to 3D models, i quickly made the connection as a pathway forward.