HomeHubArticleI Tried 13 AI Video Generators to Make My Dog Cook Pancakes. Here’s What Actually Happened.

I Tried 13 AI Video Generators to Make My Dog Cook Pancakes. Here’s What Actually Happened.

Updated: Mar 18, 2026

AI video models are getting incredibly realistic. In theory, generating a short scene like “a dog cooking pancakes at a street stall” should be simple.

In practice, it turned into a small experiment.

I wanted to create a short, slightly absurd video: my Yorkshire Terrier running a pancake stall while I’m away from home. The dog cleans the griddle, pours batter, cracks an egg, adds sausage, and hands a finished pancake to a customer.

It sounded straightforward. But after running 13 generations across multiple AI video models, I quickly discovered something surprising:
AI is still very bad at basic cooking logic.

Instead of a clean pancake-making sequence, the results included customers taking over the kitchen, eggs appearing from the batter bowl, and packaging bags materializing out of nowhere.

So I decided to treat the whole process as a small experiment.

The Experiment Setup

To test how different models handled the same idea, I generated the scene using several AI video models available on PicLumen.

The models I tested included:

In total, I ran 13 generations with slightly different prompts.

Since many of these tools are developed by Chinese companies, I also tested both English and Chinese prompts to see if language would improve the model’s understanding of the scene.

Interestingly, switching languages didn’t significantly improve the results. The models behaved roughly the same regardless of prompt language.

What the Models Did Well

Among the models I tested, Kling 3.0 produced the most convincing results visually.

The lighting, camera movement, and environment looked noticeably more realistic than the other models. Even more interesting was how the model handled background characters.

Unlike other video generators, where pedestrians simply walk in the background, Kling sometimes gave NPCs their own logic. People passing by the food stall occasionally slowed down, looked at the stand, or spoke to the vendor.

It felt closer to a living street scene.

Other models tended to treat background characters more like moving decorations.

Wan 2.6 had another interesting quirk: food elements looked noticeably artificial. Eggs in particular often had a very strong “AI texture” that immediately broke realism.

The Problems That Kept Appearing

Problem 1

Problem 2

Problem 3

a yorshire terrier sitting on the chinese pancake stall and a man cooking a pancakeai generated video issue of chili oil emerging from the batterai generated video issue of suddenly appeared food packaging bag

Across almost all models, three problems kept repeating.

The first issue involved the human customer. Instead of behaving like a customer, the person often started acting like the boss of the stall. In several generations, the customer began adding sausage or brushing sauce onto the pancake themselves. It was as if the AI decided cooking should be a collaborative activity.

The second problem was ingredient logic. Eggs, chili oil, and sausage frequently appeared from the batter bowl rather than from their own containers. The model clearly struggled with the idea that different ingredients should come from different places.

The third issue was object spawning. Packaging bags often appeared suddenly at the end of the sequence without ever being shown earlier in the scene.

Individually, these glitches are funny. Together they show how difficult it still is for AI video models to maintain physical consistency across a sequence of actions.

Adjusting the Prompt

To reduce these errors, I simplified and restructured the prompt.

The original prompt contained a long chain of dialogue and actions: the dog cleaning the griddle, talking to the customer, cracking eggs, asking about spice level, brushing chili oil, adding sausage, folding the pancake, packaging it, and handing it over.

That many steps turned out to be difficult for the model to track.

The revised prompt focused on three things:

  1. clearly defining where each ingredient came from

  2. simplifying the action sequence

  3. restricting what the human customer could do

For example, instead of simply mentioning eggs or chili oil, the revised prompt explicitly described an egg tray, a chili oil bottle, and a sausage bowl on the counter. This small detail significantly reduced the number of ingredients appearing in the wrong place.

I also added a constraint stating that the customer only receives the food and does not interact with the cooking tools. While this did not completely eliminate the issue, it improved the results noticeably.

Previous Prompt

New Prompt

The dog tidies up the workspace and uses a clean cloth to wipe down the griddle. At this moment, a human customer enters the frame and calls out: "Hey, Chef! Let me get a pancake—with sausage!" the dog scoops up a ladleful of batter and pours it onto the griddle, then uses the spatula to spread the batter evenly in a clockwise motion, covering the entire surface of the griddle. Meanwhile, it asks the customer: "Would you like an egg?" The customer replies: "Add one, thanks. Hurry it up, Boss—I'm running late for work!" The dog takes an egg, cracks the shell, and pours the contents onto the pancake; while performing this task, it asks: "Would you like it spicy?" The customer answers: "Just a little chili oil, please." The dog dips a brush into the chili oil, applying a small amount to the pancake. Next, it sprinkles on some diced sausage, uses the spatula to fold and roll up the pancake, places it into a paper bag, hands it to the customer, and says: "Here is your pancake. Enjoy! Come visit us again soon!"
A human customer stands quietly in front of the stall and says: “Hey chef, one pancake with sausage please.” 
The dog scoops batter from the batter bowl using a ladle and pours it onto the hot griddle. It uses a spatula to gently spread the batter into a round pancake. 
The dog then takes one egg from the egg tray, cracks it carefully, and pours the egg onto the pancake. 
Next, the dog takes a brush and lightly dips it into the chili oil bottle, brushing a small amount of chili oil on the pancake. 
Then the dog grabs diced sausage from the sausage bowl and sprinkles it evenly onto the pancake.

What Actually Helped (and What Didn’t)

After multiple generations, a few patterns became clear. Some prompt adjustments made a meaningful difference, while others had almost no impact.

Prompt Adjustment

Purpose

Result

Specify ingredient containers (egg tray, sausage bowl, chili oil bottle)

Prevent ingredients from appearing in the batter bowl

Very effective

Simplify the cooking sequence into clear steps

Reduce action confusion

Moderately effective

Add behavior limits for the customer

Prevent the customer from cooking

Partially effective

Switch from English prompts to Chinese prompts

Test language comprehension

No significant change

Add more dialogue between characters

Improve realism

Often made results worse

Describe objects earlier in the scene

Reduce object spawning

Helpful in some cases

The most useful change turned out to be the simplest one: explicitly describing where objects are located in the scene.

What This Experiment Taught Me

The goal of this experiment wasn’t just to make a funny video of a dog cooking pancakes. It also revealed something interesting about how current AI video models interpret scenes.

Visually, most models are already very strong. Lighting, environments, and character movement can look surprisingly cinematic. In several generations, the stall environment even felt believable at a glance.

Where things start to break down is the action logic.

Cooking scenes are a surprisingly difficult test for AI because they involve multiple objects interacting in a specific sequence. Batter is poured, eggs are cracked, sauce is brushed, ingredients are added, and the final food is wrapped and handed to a customer. Humans process this chain of actions effortlessly, but AI models often struggle to keep track of it.

From my previous successful video experiments, I’ve noticed a pattern: when a single shot contains too many consecutive actions, the model tends to lose track of what is happening. The result often looks strangely chaotic—as if the characters suddenly “forgot” what they were doing.

In the pancake scene, this is exactly what happened. When the prompt described a long sequence of actions within one shot, ingredients began appearing from the wrong containers, customers started interfering with the cooking process, and objects appeared out of nowhere.

In practice, a shot works much better when it contains only two or three clear actions. Once the sequence becomes longer than that, the model often becomes confused about object relationships and physical continuity.

Breaking a complex task into several simpler shots usually produces much more stable results.

Ironically, the dog chef wasn’t the hardest part of the scene.

The hardest part was simply making the cooking process follow basic kitchen logic.

Final Thoughts

My Yorkshire Terrier never quite became the perfect pancake chef I imagined.

But the process of trying revealed something more interesting: generating believable everyday actions is still one of the hardest challenges for AI video models.

The good news is that small prompt adjustments—especially clarifying object locations and simplifying actions—can significantly improve results.

Sometimes the most useful tutorials don’t come from perfect results.

They come from the experiments that went a little wrong.

Jessie
Jessie
294
4
0
610Views
Mar 18, 2026
Discussion
Add a comment