What "computer vision in your kitchen" actually means

Sarah Carter · ML Engineer, CheffyIQ

·22 April 2026·6 min read

Most operators we talk to have one of two reactions to "AI cameras". Either it's magic ("So it just… knows everything?") or it's nothing ("Cameras have been around for 30 years."). Both are wrong. Here's what's really happening behind the screen.

The basic loop: see, label, decide

Modern computer vision systems do three things, every fraction of a second:

See — turn a camera frame into a grid of numbers (pixels).
Label — pass that grid through a trained model that draws boxes around things and tags them: chef, hand, glove, plate, brisket.
Decide — apply rules on top of those labels: "if a hand touches food without a glove, raise a violation."

The first two steps are what people call "AI". The third step is the same boring rules engine you'd write in any software. The magic is in step two.

What does a kitchen camera actually "see"?

Here's a real frame from one of our edge boxes. Each colored box is one detection, with a label and a confidence score:

person 0.97 [412, 218, 580, 712]
chef_hat 0.94 [440, 218, 540, 280]
hand_bare 0.88 [510, 510, 580, 590]
food_dish 0.92 [470, 580, 690, 720]
violation:no_glove 0.85

That's it. Numbers. The "intelligence" is the model knowing that those particular pixels look more like a bare hand than a gloved hand — not because someone told it, but because it has seen 1.4 million hand images during training.

YOLO, in 60 seconds

YOLO ("You Only Look Once") is the family of models that do object detection fast enough for real-time video. The trick: instead of scanning the image with a sliding window (slow), it predicts all boxes in one forward pass through the network (fast).

For us, fast matters. We have to look at every frame from every camera, every shift. A model that takes 800ms per frame is useless. A model that takes 18ms per frame can watch 4 cameras at 14 fps from a single Jetson edge box.

"The best AI in the kitchen isn't the one with the most parameters. It's the one with the most training data from real kitchens."

Why "we trained it on kitchen footage" matters

Generic models are trained on internet photos — bowls, ovens, professional food photography. They fall apart in real kitchens because real kitchens look like:

Bad fluorescent lighting at 4 different color temperatures
Steam clouding the lens for 20 seconds at a time
Burnt patches and oil splatters everywhere
Chefs of different skin tones, with tattoos, in motion-blur
Dishes plated by 12 different chefs across 6 outlets, all slightly different

We've labeled 380,000 frames of real restaurant kitchen footage to teach our models what these things actually look like. That's the moat — not the algorithm.

"Embeddings" — how we tell two briskets apart

Object detection tells you "this is brisket". But you also need to know is this brisket plated correctly? That's where embeddings come in.

Think of an embedding as a list of 256 numbers that captures the "essence" of a dish — rice ratio, garnish placement, color balance, portion size. Two briskets that look identical produce nearly identical embeddings. Two that are different produce numbers that are far apart in this 256-dimensional space.

So when a chef plates today's brisket, we compute the embedding, compare it to the reference dish, and report a "match score" of 91%. No manual rules. The model just… knows what brisket-shaped looks like.

What it can't do (yet)

Computer vision is not omniscient. Things our system regularly gets wrong, and how we handle them:

Occlusion: if a chef's hand is hidden behind a counter, we don't know what it's doing. We log "low confidence" and don't raise alerts.
Novel dishes: a dish the model hasn't seen will get classified as the closest known dish, with low confidence. We surface these for review.
Intent: we can't tell why a chef did something. Just what they did.

The takeaway for operators

You don't need to know how a transmission works to drive a car. But knowing that AI is "see, label, decide" — trained on lots of examples — helps you ask better questions when evaluating systems:

"How much of your training data is from kitchens like mine?"
"What happens when the model is wrong — how do I correct it?"
"How do you handle false positives so my chefs don't get spammed?"

Sarah Carter

ML Engineer at CheffyIQ. Previously at Uber and Niramai. Spends weekends labeling food.

Edge AI vs cloud AI: which is right for your kitchen?

Latency, privacy, cost. Three reasons we put inference on a Jetson box.

How Hearth & Stone cut hygiene violations by 78% in 60 days

Behind-the-scenes of an 18-outlet AI rollout.

See it on your menu

Send us 5 dishes. We'll show you what our model sees.