Skip to main content

Multi-Hop & Agentic Retrieval

Seer supports logging and evaluating multi-step retrieval workflows — from decomposed queries to agentic RAG patterns.

Overview

Many real-world queries can't be answered with a single retrieval. Consider:

"What awards did the director of Inception win?"

This requires:

  1. First, find who directed Inception → Christopher Nolan
  2. Then, find what awards Christopher Nolan won

Seer tracks each hop separately while computing trace-level metrics from the final context.


Key Fields

task — The Original Query

Always pass the original user query in task. This is what Seer evaluates against for end-to-end relevance.

subquery — The Decomposed Question

The subquery is what this specific retrieval hop is trying to answer. A query rewriter or planner typically generates these.

is_final_context — Final Evidence for the LLM

Mark the retrieval step whose context is passed to the LLM or agent for final answer synthesis. Seer uses this span for trace-level metrics.


Complete Example: Query Decomposition

from seer import SeerClient
from opentelemetry import trace

client = SeerClient()
tracer = trace.get_tracer(__name__)

def answer_multi_hop_question(query: str) -> str:
"""
Example: "What awards did the director of Inception win?"

This requires decomposing into:
1. "Who directed Inception?"
2. "What awards has [director] won?"
"""
with tracer.start_as_current_span("multi_hop_query"):

# Hop 1: Find the director
# A query rewriter rewrites the original question to get the first piece
subquery1 = "Who directed Inception?"

with tracer.start_as_current_span("retrieval_hop_1"):
hop1_context = retrieve(subquery1)

client.log(
task=query, # Original: "What awards did..."
context=hop1_context,
subquery=subquery1, # "Who directed Inception?"
span_name="retrieval_hop_1",
)

# Extract the answer: "Christopher Nolan"
director = extract_entity(hop1_context, "director")

# Hop 2: Find awards for the director
# Query rewriter uses extracted entity to form the next subquery
subquery2 = f"What awards has {director} won?"

with tracer.start_as_current_span("retrieval_hop_2"):
hop2_context = retrieve(subquery2)

client.log(
task=query, # Still the original query
context=hop2_context,
subquery=subquery2, # "What awards has Christopher Nolan won?"
span_name="retrieval_hop_2",
)

# Combine contexts from all hops
all_context = hop1_context + hop2_context

# Log the final joined context that goes to the LLM
with tracer.start_as_current_span("context_join"):
client.log(
task=query,
context=all_context, # Combined context from all hops
span_name="final_context",
is_final_context=True, # THIS is what the LLM sees
)

return synthesize_answer(query, all_context)

What Seer Evaluates

For each hop, Seer computes:

EvaluationAgainstPurpose
Task RecallOriginal taskIs this hop contributing to the end goal?
Subquery RecallsubqueryDid this hop answer its specific question?

Example Metrics

SpanSubquerySubquery RecallTask Recall
Hop 1"Who directed Inception?"100% (found Nolan)50% (partial answer)
Hop 2"What awards has Nolan won?"100% (found awards)80% (most of answer)
Final Context100% (complete)

Trace-level metrics are computed from the is_final_context=True span (the joined context).


Trace-Level vs Span-Level Metrics

Metric LevelScopeUse Case
Span-levelIndividual retrieval stepDebug which hop failed
Trace-levelFinal contextEnd-to-end quality for the user

Trace-Based Sampling

When you provide a trace_id (auto-detected from OTEL), Seer ensures all spans in the trace get the same sampling decision. You'll never see partial traces.


Agentic RAG Patterns

For agent loops where the number of retrievals is dynamic:

def agent_loop(query: str, max_iterations: int = 5):
"""Agent decides when to retrieve and when to answer."""
context_so_far = []

with tracer.start_as_current_span("agent_query"):
for i in range(max_iterations):
# Agent decides next action
action = agent.plan_next_action(query, context_so_far)

if action.type == "retrieve":
# Agent wants more information
results = retrieve(action.search_query)
context_so_far.extend(results)

client.log(
task=query,
context=results,
subquery=action.search_query, # What the agent searched for
span_name=f"agent_retrieval_{i}",
metadata={
"iteration": i,
"agent_reasoning": action.reasoning,
},
)

elif action.type == "answer":
# Agent is ready to synthesize
# Mark the final context
client.log(
task=query,
context=context_so_far,
span_name="final_context",
is_final_context=True,
)
break

return agent.synthesize(query, context_so_far)

More Examples

Parallel Retrieval

When you search multiple sources in parallel:

with tracer.start_as_current_span("parallel_retrieval"):
# Search multiple sources simultaneously
wiki_results = retrieve_from_wiki(query)
kb_results = retrieve_from_kb(query)

client.log(task=query, context=wiki_results, span_name="retrieval_wiki")
client.log(task=query, context=kb_results, span_name="retrieval_kb")

# Combine and pass to LLM
combined = wiki_results + kb_results
client.log(
task=query,
context=combined,
span_name="final_merged",
is_final_context=True,
)

Iterative Refinement

When you re-retrieve based on LLM feedback:

# Initial retrieval
initial_context = retrieve(query)
client.log(task=query, context=initial_context, span_name="retrieval_initial")

# LLM suggests refinement
refined_query = llm.suggest_refined_query(query, initial_context)

# Refined retrieval
refined_context = retrieve(refined_query)
client.log(
task=query,
context=refined_context,
subquery=refined_query,
span_name="retrieval_refined",
is_final_context=True,
)

Best Practices

1. Always Set is_final_context for the Last Hop

This enables trace-level metrics that reflect end-user experience:

# The context that actually goes to the LLM
client.log(..., is_final_context=True)

2. Keep task Consistent Across Hops

The original query should stay the same — that's what you're ultimately trying to answer:

# ✓ Correct: Same task, different subqueries
client.log(task=original_query, subquery="Who is X?", ...)
client.log(task=original_query, subquery="What did X do?", ...)

# ✗ Wrong: Changing task per hop
client.log(task="Who is X?", ...) # Don't do this

3. Use Subqueries for Decomposition

Subqueries help diagnose which step failed:

# If task recall is low but subquery recall is high,
# the problem is query decomposition, not retrieval

4. Use Consistent Span Names

PatternSpan Name
Sequential hopsretrieval_hop_1, retrieval_hop_2
Parallel sourcesretrieval_wiki, retrieval_kb
Agent iterationsagent_retrieval_0, agent_retrieval_1
Final mergedfinal_context

See Also