When Document and Query Embeddings Don’t Match: A ...

AparnaRamakris · ‎02-03-2026

When the RAG retrieval quality is often inconsistent, recall is poor, and re rankers end up compensating for weaknesses in the pipeline ,I have often heard people mentioning they are using the same embedding model for both the document and the queries .

Here the reason is subtle but critical. Using the same model does not guarantee that document embeddings and query embeddings occupy the same semantic space in practice.Embedding models map text into a high‑dimensional vector space. Retrieval works only when documents and queries are represented symmetrically .that is, when they encode comparable semantic intent. Any mismatch in how text is structured ,how long the text is ,how the input text and the document is preprocessed and what intent it represents could lead to asymmetry which is called a retrieval asymmetry.

Below are some of the use cases where you typically see an asymmetry or poor retrieval performance in your RAG systems

1 Instruction and Intent Asymmetry

While the document contains facts, descriptive contents the query is more interrogative in nature .Even with the same model they could potentially end up in different regions of the embedding space.For e.g. if the Document is "The company started in 1986 with 30 employees and over the years the company strength increased with over 1000 employees now " and the Query is "what was the employee count in 1986 ?" both are semantically related ,but they represent different roles one is a statement and the other is a question.

2 Chunk length and semantic density mismatch

A common indexing practice is to embed large document chunks (800–1500 tokens) to capture broader context. Queries, however, are typically very short. This creates a fundamental imbalance. Document embeddings encode multiple topics simultaneously whereas query embeddings encode a single, narrow intent.The document embedding represents multiple concepts, while the query focuses on one.

Smaller, intent‑aligned chunks typically retrieve better results.

def main():
    long_index = "chunks-long" #created using longer chunks 
    short_index = "chunks-short"
    # Create indexes
    create_vector_index(long_index)
    create_vector_index(short_index)
    # Source doc with multiple topics (this is intentional)
    doc_id = "doc-employee-policy"
    doc_text = (
        "This policy covers annual leave, sick leave, maternity benefits, travel reimbursements, "
        "and performance review cycles. \n\n"
        "Maternity leave: Employees are eligible for up to 26 weeks of maternity leave, "
        "including 8 weeks pre-delivery. \n\n"
        "Travel reimbursements: Only economy class travel is reimbursable unless approved. \n\n"
        "Performance reviews: Reviews happen twice a year with mid-year checkpoints."
    )
    # Chunking
    long_chunks = chunk_long(doc_text)
    short_chunks = chunk_short(doc_text, max_chars=220)
    # Upload
    upload_chunks(long_index, doc_id, long_chunks, "long")
    upload_chunks(short_index, doc_id, short_chunks, "short")
    # Query (plain, focused intent)
    query = "What is the maternity leave duration?"
    print("\n============================")
    print("QUERY:", query)
    print("============================\n")
    print("---- Results from LONG chunks ----")
    long_results = vector_search(long_index, query, k=5)
    print(json.dumps(long_results, indent=2))
    print("\n---- Results from SHORT chunks ----")
    short_results = vector_search(short_index, query, k=5)
    print(json.dumps(short_results, indent=2))
    print("Expected observation:")
    print("- SHORT chunks tend to return the exact maternity section.")
    print("- LONG chunks often pull broader policy chunks where maternity is just a small part (semantic dilution).")

if __name__ == "__main__":
    main()

Example

Document: This policy covers annual leave, sick leave, maternity benefits,insurance coverage, and termination procedures…

Query: What is the maternity leave policy?

3 Structural and formatting Asymmetry

Documents often contain structure that queries do not like headers, bullet points,tables,lists,logs or JSON fragments. If this structure is embedded in the original form, the model captures layout and syntactic cues in addition to semantics. Queries, on the other hand, are usually plain natural language. This mismatch can introduce noise into document embeddings, especially when chunk boundaries or formatting tokens dominate the representation.

def main():
    # Simulated embedding function (intentionally simplistic)
    # Counts structural tokens to show how formatting changes vectors
    def embed(text):
        return {
            "words": len(text.split()),
            "pipes": text.count("|"),       # table structure
            "bullets": text.count("-"),     # bullet points
            "hashes": text.count("#"),      # markdown headers
        }

    # Document with heavy structure
    structured_document = """
    ## Leave Policy

    | Type        | Duration |
    |-------------|----------|
    | Maternity   | 26 weeks |
    | Sick Leave  | 12 days  |

    - Applies to full-time employees
    - Requires manager approval
    """

    # User query is plain text
    plain_text_query = "What is the maternity leave duration?"

    doc_embedding = embed(structured_document)
    query_embedding = embed(plain_text_query)

    print("Document embedding:", doc_embedding)
    print("Query embedding:   ", query_embedding)

    print("\nObservation:")
    print("- Same meaning")
    print("- Same embedding logic")
    print("- Different structure → different vector representation")
    print("- Leads to weak similarity in vector search")


if __name__ == "__main__":
    main()
``

4 Preprocessing inconsistencies

Documents might be lower‑cased, stripped of punctuation, cleaned of stop words etc. whereas queries are often raw user input which are conversational. Even small preprocessing differences change token statistics, which can move vectors in subtle but harmful ways particularly in hybrid (BM25 + vector) retrieval setups.

5 Metadata Included only for the documents

It’s common to embed documents along with metadata. Queries rarely include equivalent metadata fields. This causes document embeddings to partially represent metadata semantics, while query embeddings represent only user intent.In such cases, retrieval may favor documents with metadata overlap rather than true semantic relevance.

6 Query rewriting without document alignment

Many RAG systems apply LLM‑based query rewriting to improve clarity and recall. This is generally a good idea. However, if the queries are rewritten into structured or expanded forms ,but documents remain embedded in their raw form,both query and document could again end up in a different vector space

7 Temporal and vocabulary drift

Documents are often static snapshots written months or years ago using older terminology or product names while the queries reflect current language ,new acronyms and rebranded services .Embedding models are semantic, not omniscient. Without synonym expansion or alias mapping, vectors drift apart over time.

def main():
    """
    Demo: Temporal & Vocabulary Drift
    --------------------------------
    Same meaning, different time + different terms.
    Older documents and newer queries drift apart
    even with the same embedding model.
    """

    # Extremely simplified "embedding" to show vocabulary effect
    def embed(text):
        vocab = set(text.lower().split())
        return vocab
    old_document = "Power BI Premium Gen2 capacity supports large enterprise analytics workloads."
     new_query = "How does Microsoft Fabric capacity work?"
    doc_embedding = embed(old_document)
    query_embedding = embed(new_query)
    overlap = doc_embedding.intersection(query_embedding)

    print("Document terms:", doc_embedding)
    print("Query terms:   ", query_embedding)
    print("\nVocabulary overlap:", overlap)
    print("\nObservation:")
    print("- Same underlying concept (capacity)")
    print("- Different terminology over time")
    print("vector drift causing weak retrieval")


if __name__ == "__main__":
    main()

8 Modality asymmetry in multimodal RAG

Retrieval-Augmented Generation (RAG) was supposed to fix AI hallucination by grounding responses in real doc. For simple text Q&A, it works beautifully.But enterprise knowledge doesn’t live in simple text files.It lives in design docs with architecture diagrams, screenshots, embedded charts, and runbooks mixing instructions with visual references.

We can think of a solution to create a single vector space for all images ,text ,charts etc and retrieve the best result based on the similarity score with the query .But the researchers discovered something called the modality gap ie when you force different content types into one embedding space, items cluster by format rather than meaning. Text queries gravitationally pull toward text results, even when an image would answer the question better.

Unless the model is explicitly trained for cross‑modal alignment, text and image embeddings will not be comparable.This is why practices such as image captioning followed by text embedding consistently improve retrieval performance.

Conclusion

Using the same embedding model does not guarantee effective retrieval .Retrieval quality depends on symmetry across intent, structure ,preprocessing, granularity and modality.When document and query embeddings are not aligned across these dimensions, RAG systems rely on rerankers to compensate often masking deeper recall issues.Designing for retrieval symmetry is one of the most impactful optimizations in modern RAG pipelines.

FabCon is coming to Atlanta

When Document and Query Embeddings Don’t Match: A Practical Guide to Retrieval Asymmetry in RAG