AI video models are getting incredibly realistic. In theory, generating a short scene like “a dog cooking pancakes at a street stall” should be simple.
In practice, it turned into a small experiment.
I wanted to create a short, slightly absurd video: my Yorkshire Terrier running a pancake stall while I’m away from home. The dog cleans the griddle, pours batter, cracks an egg, adds sausage, and hands a finished pancake to a customer.
It sounded straightforward. But after running 13 generations across multiple AI video models, I quickly discovered something surprising:
AI is still very bad at basic cooking logic.
Instead of a clean pancake-making sequence, the results included customers taking over the kitchen, eggs appearing from the batter bowl, and packaging bags materializing out of nowhere.
So I decided to treat the whole process as a small experiment.
The Experiment Setup
To test how different models handled the same idea, I generated the scene using several AI video models available on PicLumen.
The models I tested included:
In total, I ran 13 generations with slightly different prompts.
Since many of these tools are developed by Chinese companies, I also tested both English and Chinese prompts to see if language would improve the model’s understanding of the scene.
Interestingly, switching languages didn’t significantly improve the results. The models behaved roughly the same regardless of prompt language.
What the Models Did Well
Among the models I tested, Kling 3.0 produced the most convincing results visually.
The lighting, camera movement, and environment looked noticeably more realistic than the other models. Even more interesting was how the model handled background characters.
Unlike other video generators, where pedestrians simply walk in the background, Kling sometimes gave NPCs their own logic. People passing by the food stall occasionally slowed down, looked at the stand, or spoke to the vendor.
It felt closer to a living street scene.
Other models tended to treat background characters more like moving decorations.
Wan 2.6 had another interesting quirk: food elements looked noticeably artificial. Eggs in particular often had a very strong “AI texture” that immediately broke realism.
The Problems That Kept Appearing
Problem 1 | Problem 2 | Problem 3 |
|---|---|---|
![]() | ![]() | ![]() |
Across almost all models, three problems kept repeating.
The first issue involved the human customer. Instead of behaving like a customer, the person often started acting like the boss of the stall. In several generations, the customer began adding sausage or brushing sauce onto the pancake themselves. It was as if the AI decided cooking should be a collaborative activity.
The second problem was ingredient logic. Eggs, chili oil, and sausage frequently appeared from the batter bowl rather than from their own containers. The model clearly struggled with the idea that different ingredients should come from different places.
The third issue was object spawning. Packaging bags often appeared suddenly at the end of the sequence without ever being shown earlier in the scene.
Individually, these glitches are funny. Together they show how difficult it still is for AI video models to maintain physical consistency across a sequence of actions.
Adjusting the Prompt
To reduce these errors, I simplified and restructured the prompt.
The original prompt contained a long chain of dialogue and actions: the dog cleaning the griddle, talking to the customer, cracking eggs, asking about spice level, brushing chili oil, adding sausage, folding the pancake, packaging it, and handing it over.
That many steps turned out to be difficult for the model to track.
The revised prompt focused on three things:
clearly defining where each ingredient came from
simplifying the action sequence
restricting what the human customer could do
For example, instead of simply mentioning eggs or chili oil, the revised prompt explicitly described an egg tray, a chili oil bottle, and a sausage bowl on the counter. This small detail significantly reduced the number of ingredients appearing in the wrong place.
I also added a constraint stating that the customer only receives the food and does not interact with the cooking tools. While this did not completely eliminate the issue, it improved the results noticeably.
Previous Prompt | New Prompt |
|---|---|
| |
What Actually Helped (and What Didn’t)
After multiple generations, a few patterns became clear. Some prompt adjustments made a meaningful difference, while others had almost no impact.
Prompt Adjustment | Purpose | Result |
Specify ingredient containers (egg tray, sausage bowl, chili oil bottle) | Prevent ingredients from appearing in the batter bowl | Very effective |
Simplify the cooking sequence into clear steps | Reduce action confusion | Moderately effective |
Add behavior limits for the customer | Prevent the customer from cooking | Partially effective |
Switch from English prompts to Chinese prompts | Test language comprehension | No significant change |
Add more dialogue between characters | Improve realism | Often made results worse |
Describe objects earlier in the scene | Reduce object spawning | Helpful in some cases |
The most useful change turned out to be the simplest one: explicitly describing where objects are located in the scene.
What This Experiment Taught Me
The goal of this experiment wasn’t just to make a funny video of a dog cooking pancakes. It also revealed something interesting about how current AI video models interpret scenes.
Visually, most models are already very strong. Lighting, environments, and character movement can look surprisingly cinematic. In several generations, the stall environment even felt believable at a glance.
Where things start to break down is the action logic.
Cooking scenes are a surprisingly difficult test for AI because they involve multiple objects interacting in a specific sequence. Batter is poured, eggs are cracked, sauce is brushed, ingredients are added, and the final food is wrapped and handed to a customer. Humans process this chain of actions effortlessly, but AI models often struggle to keep track of it.
From my previous successful video experiments, I’ve noticed a pattern: when a single shot contains too many consecutive actions, the model tends to lose track of what is happening. The result often looks strangely chaotic—as if the characters suddenly “forgot” what they were doing.
In the pancake scene, this is exactly what happened. When the prompt described a long sequence of actions within one shot, ingredients began appearing from the wrong containers, customers started interfering with the cooking process, and objects appeared out of nowhere.
In practice, a shot works much better when it contains only two or three clear actions. Once the sequence becomes longer than that, the model often becomes confused about object relationships and physical continuity.
Breaking a complex task into several simpler shots usually produces much more stable results.
Ironically, the dog chef wasn’t the hardest part of the scene.
The hardest part was simply making the cooking process follow basic kitchen logic.
Final Thoughts
My Yorkshire Terrier never quite became the perfect pancake chef I imagined.
But the process of trying revealed something more interesting: generating believable everyday actions is still one of the hardest challenges for AI video models.
The good news is that small prompt adjustments—especially clarifying object locations and simplifying actions—can significantly improve results.
Sometimes the most useful tutorials don’t come from perfect results.
They come from the experiments that went a little wrong.



