Concepts

How Food Embeddings Work

What are embeddings?

An embedding is a numeric vector (a list of numbers) that captures the meaning of a piece of text. Similar meanings produce similar vectors. You can measure how similar two items are by comparing their vectors using cosine similarity.

  • 1.0 = identical meaning
  • 0.8+ = very similar (likely the same dish)
  • 0.5-0.7 = related (same category or cuisine)
  • Below 0.3 = unrelated

Why food needs specialized embeddings

General-purpose embedding models (the kind you'd use for document search or chatbot retrieval) fail on food data in specific ways:

Transliteration blindness

"Murgh" is Hindi for chicken. "Murgh Makhani" and "Butter Chicken" are the same dish. Generic models treat "Murgh" as an unknown token and produce low similarity scores. dish-embed maps transliterations correctly across languages and scripts.

Noise sensitivity

Real menu data looks like this:

**NEW** 50% OFF Chicken Biryani (Serves 2) [Non-Veg]

A generic model embeds all that noise as part of the meaning. dish-embed strips it before embedding, so this matches "Chicken Biryani" with high confidence.

Cross-lingual understanding

"Pollo Asado" (Spanish), "Grilled Chicken" (English), "Murgh Tandoori" (Hindi) are all grilled chicken preparations. dish-embed produces similar embeddings for them across 100+ languages.

Dietary signal preservation

Generic models don't know that "Paneer Tikka" is vegetarian and "Chicken Tikka" is not. They see high text overlap and produce high similarity. dish-embed understands that protein differences change the fundamental nature of a dish.

What dish-embed knows

dish-embed has food-specific knowledge baked in:

  • Which items are the same dish under different names
  • Which items are related but distinct (Butter Chicken vs Dal Makhani)
  • Cross-lingual equivalences across 100+ languages
  • Cuisine and category relationships across Indian, East Asian, Southeast Asian, Middle Eastern, European, Latin American, and American cuisines

Using embeddings directly

If you want to store embeddings in your own vector database for custom search or clustering:

resp = requests.post(f"{BASE}/embed", headers=headers,
    json={"items": ["Chicken Biryani", "Murgh Biryani", "Veg Pulao"], "dimension": 384})

embeddings = resp.json()["embeddings"]
# Each embedding is a list of 384 floats
# Store in Pinecone, Weaviate, pgvector, FAISS, etc.

You can choose your embedding dimension (128, 256, or 384) depending on your quality and storage requirements. See Matryoshka Dimensions for trade-offs.