AI Blog
  • Home
  • Handbook
    • SQL hangbook
    • R handbook
    • Python handbook
    • tensorflowing handbook
    • AI handbook
  • Blog
  • CV / 简历

On this page

  • Introduction
  • Data Collection
  • Save web to local data
  • Retrieval
  • Chat with RAG

Retrieval-Augmented Generation(RAG) in R & Python

  • Show All Code
  • Hide All Code

  • View Source
AI
API
tutorial
Author

Tony D

Published

November 2, 2025

Introduction

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the generative capabilities of Large Language Models (LLMs) with the precision of information retrieval. By grounding LLM responses in external, verifiable data, RAG reduces hallucinations and enables the model to answer questions about specific, private, or up-to-date information.

In this tutorial, we will build a RAG system using both R and Python.

In R, we’ll leverage the ragnar package for RAG workflows and ellmer for chat interfaces.

In Python, we’ll use LangChain for the RAG pipeline, ChromaDB for the vector store, and OpenAI for model interaction.

Our goal is to create a system that can answer questions about the OpenRouter API by scraping its documentation.

Data Collection

  • R
  • Python

First, we need to gather the data for our knowledge base. We’ll use the rvest package to scrape URLs from the OpenRouter documentation. This will give us a list of pages to ingest.

Code
library(ragnar)
library(ellmer)
library(dotenv)
load_dot_env(file = ".env")
Code
library(rvest)

# URL to scrape
url <- "https://openrouter.ai/docs/quickstart"

# Read the HTML content of the page
page <- read_html(url)

# Extract all <a> tags with href
links <- page %>%
    html_nodes("a") %>%
    html_attr("href")

# Remove NAs and duplicates
links <- unique(na.omit(links))

# Optional: keep only full URLs
links_full <- paste0("https://openrouter.ai", links[grepl("^/docs/", links)])

# Print all links
print(links_full)
 [1] "https://openrouter.ai/docs/api-reference/overview"                               
 [2] "https://openrouter.ai/docs/quickstart"                                           
 [3] "https://openrouter.ai/docs/api/reference/overview"                               
 [4] "https://openrouter.ai/docs/sdks/call-model/overview"                             
 [5] "https://openrouter.ai/docs/guides/overview/principles"                           
 [6] "https://openrouter.ai/docs/guides/overview/models"                               
 [7] "https://openrouter.ai/docs/faq"                                                  
 [8] "https://openrouter.ai/docs/guides/overview/report-feedback"                      
 [9] "https://openrouter.ai/docs/guides/routing/model-fallbacks"                       
[10] "https://openrouter.ai/docs/guides/routing/provider-selection"                    
[11] "https://openrouter.ai/docs/guides/features/presets"                              
[12] "https://openrouter.ai/docs/guides/features/tool-calling"                         
[13] "https://openrouter.ai/docs/guides/features/structured-outputs"                   
[14] "https://openrouter.ai/docs/guides/features/message-transforms"                   
[15] "https://openrouter.ai/docs/guides/features/zero-completion-insurance"            
[16] "https://openrouter.ai/docs/guides/features/zdr"                                  
[17] "https://openrouter.ai/docs/app-attribution"                                      
[18] "https://openrouter.ai/docs/faq#how-are-rate-limits-calculated"                   
[19] "https://openrouter.ai/docs/api/reference/streaming"                              
[20] "https://openrouter.ai/docs/guides/community/frameworks-and-integrations-overview"

First, we need to gather the data for our knowledge base. We’ll use requests and BeautifulSoup to scrape URLs from the OpenRouter documentation. This will give us a list of pages to ingest.

Code
import sys
print(sys.executable)
/Library/Frameworks/Python.framework/Versions/3.13/bin/python3
Code
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
from markitdown import MarkItDown
from io import BytesIO
import re

# Load environment variables
load_dotenv()
True
Code
# Helper functions
def fetch_html(url: str) -> bytes:
    """Fetch HTML content from URL and return as bytes."""
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.content

def html_to_markdown(html_bytes: bytes) -> str:
    """Convert HTML bytes to markdown using MarkItDown."""
    md = MarkItDown()
    stream = BytesIO(html_bytes)
    result = md.convert_stream(stream, mime_type="text/html")
    return result.markdown

def save_markdown(md_content: str, output_path: str):
    """Save markdown content to file."""
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(md_content)

def sanitize_filename(filename: str) -> str:
    """Sanitize URL to create a valid filename."""
    filename = re.sub(r'^https?://[^/]+', '', filename)
    filename = re.sub(r'[^\w\-_.]', '_', filename)
    filename = filename.strip('_')
    if not filename.endswith('.md'):
        filename += '.md'
    return filename

# URL to scrape
url = "https://openrouter.ai/docs/quickstart"

# Read the HTML content of the page
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract all <a> tags with href
links = [a['href'] for a in soup.find_all('a', href=True)]

# Remove duplicates
links = list(set(links))

# Keep only full URLs for docs
links_full = [f"https://openrouter.ai{link}" for link in links if link.startswith("/docs/")]

# Explicitly add FAQ
links_full.append("https://openrouter.ai/docs/faq")
links_full = list(set(links_full))

# Print all links
print(f"Found {len(links_full)} documentation URLs")
Found 20 documentation URLs
Code
print(links_full)
['https://openrouter.ai/docs/guides/overview/report-feedback', 'https://openrouter.ai/docs/guides/features/message-transforms', 'https://openrouter.ai/docs/api-reference/overview', 'https://openrouter.ai/docs/guides/features/structured-outputs', 'https://openrouter.ai/docs/guides/routing/provider-selection', 'https://openrouter.ai/docs/api/reference/overview', 'https://openrouter.ai/docs/guides/features/tool-calling', 'https://openrouter.ai/docs/guides/overview/principles', 'https://openrouter.ai/docs/quickstart', 'https://openrouter.ai/docs/sdks/call-model/overview', 'https://openrouter.ai/docs/faq#how-are-rate-limits-calculated', 'https://openrouter.ai/docs/guides/community/frameworks-and-integrations-overview', 'https://openrouter.ai/docs/faq', 'https://openrouter.ai/docs/guides/routing/model-fallbacks', 'https://openrouter.ai/docs/guides/overview/models', 'https://openrouter.ai/docs/guides/features/zero-completion-insurance', 'https://openrouter.ai/docs/guides/features/presets', 'https://openrouter.ai/docs/api/reference/streaming', 'https://openrouter.ai/docs/guides/features/zdr', 'https://openrouter.ai/docs/app-attribution']

Save web to local data

  • R
  • Python DuckDB
  • Python Chroma

To perform semantic search, we need to store our text data as vectors (embeddings). We’ll use DuckDB as our local vector store. We also need an embedding model to convert text into vectors. Here, we configure ragnar to use a specific embedding model via an OpenAI-compatible API (SiliconFlow).

Code
# pages <- ragnar_find_links(base_url)
pages <- links_full
store_location <- "openrouter.duckdb"

store <- ragnar_store_create(
    store_location,
    overwrite = TRUE,
    embed = \(x) ragnar::embed_openai(x,
        model = "BAAI/bge-m3",
        base_url = "https://api.siliconflow.cn/v1",
        api_key = Sys.getenv("siliconflow")
    )
)

With our store initialized, we can now ingest the data. We iterate through the list of pages we scraped earlier. For each page, we: 1. Read the content as markdown. 2. Split the content into smaller chunks (approx. 600 characters). 3. Insert these chunks into our vector store.

This process builds the index that we’ll search against.

Code
# page="https://openrouter.ai/docs/faq"
# chunks <- page |>read_as_markdown() |>markdown_chunk(target_size = 2000)
# ragnar_chunks_view(chunks)
Code
for (page in pages) {
    message("ingesting: ", page)
    print(page)
    chunks <- page |>
        read_as_markdown() |>
        markdown_chunk(target_size = 2000)
    # print(chunks)
    # print('chunks done')
    ragnar_store_insert(store, chunks)
    print("insrt done")
}
Code
ragnar_store_build_index(store)
Code
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding

# --- 1. Configuration ---

# Ensure your API key is available
openrouter_api_key = os.getenv("OPENROUTER_API_KEY") # or paste string directly

# Initialize the embedding model pointing to OpenRouter
# We use the OpenAI class because OpenRouter uses an OpenAI-compatible API structure
embed_model = OpenAIEmbedding(
    api_key=openrouter_api_key,
    base_url="https://openrouter.ai/api/v1",
    model="qwen/qwen3-embedding-8b"  
)

# Update the global settings so LlamaIndex knows to use this model
Settings.embed_model = embed_model
Settings.chunk_size = 2000
Settings.chunk_overlap = 200
# --- 2. Ingestion and Indexing ---

# Load data
documents = SimpleDirectoryReader("markdown_docs").load_data()

# Initialize DuckDB Vector Store
vector_store = DuckDBVectorStore("openrouter.duckdb", persist_dir="./persist/")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the index
# This will now automatically use the Qwen embeddings defined in Settings
index = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context
)

To perform semantic search, we need to store our text data as vectors (embeddings). We’ll use ChromaDB as our local vector store. We also need an embedding model to convert text into vectors. Here, we configure a custom OpenRouterEmbeddings class to use the qwen/qwen3-embedding-8b model via the OpenRouter API.

Code
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()

# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"

With our store initialized, we can now ingest the data. We iterate through the markdown files we saved earlier. For each file, we: 1. Load the content. 2. Split the content into smaller chunks (approx. 2000 characters) using RecursiveCharacterTextSplitter. 3. Create a new Chroma vector store from these chunks.

This process builds the index that we’ll search against.

Code
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import shutil

# Helper function to load markdown files
def load_markdown_files(directory: str) -> list[Document]:
    """Load all markdown files from directory and create Document objects."""
    documents = []
    if not os.path.exists(directory):
        return documents
        
    for filename in os.listdir(directory):
        if filename.endswith('.md'):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
                doc = Document(
                    page_content=content,
                    metadata={
                        "source": filename,
                        "filepath": filepath
                    }
                )
                documents.append(doc)
    return documents

# Create output directory for markdown files
output_dir = "markdown_docs"
os.makedirs(output_dir, exist_ok=True)

# Convert each URL to markdown and save
for i, link_url in enumerate(links_full, 1):
    try:
        print(f"Processing {i}/{len(links_full)}: {link_url}")
        html_content = fetch_html(link_url)
        markdown_content = html_to_markdown(html_content)
        filename = sanitize_filename(link_url)
        output_path = os.path.join(output_dir, filename)
        save_markdown(markdown_content, output_path)
        print(f"  ✓ Saved to {output_path}")
    except Exception as e:
        print(f"  ✗ Error processing {link_url}: {str(e)}")

# Load markdown documents
documents = load_markdown_files(output_dir)
print(f"\nLoaded {len(documents)} markdown documents")

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

splits = text_splitter.split_documents(documents)
print(f"Split into {len(splits)} chunks")
Code
# Remove existing database if it exists
if os.path.exists(persist_directory):
    print(f"Removing existing database at {persist_directory}...")
    shutil.rmtree(persist_directory)

# Create new vector store
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=persist_directory
)

print(f"\n✓ Successfully created ChromaDB with {len(splits)} chunks!")
print(f"✓ Database saved to: {persist_directory}")

Retrieval

  • R
  • Python DuckDB
  • Python Chroma

Now that our knowledge base is populated, we can test the retrieval system. We can ask a specific question, like “What are model variants?”, and query the store to see which text chunks are most relevant. This confirms that our embeddings and search are working correctly.

Question:What are model variants?

RAG result:

Code
store_location <- "openrouter.duckdb"
store <- ragnar_store_connect(store_location)

text <- "What are model variants?"

#' # Retrieving Chunks
#' Once the store is set up, retrieve the most relevant text chunks like this

relevant_chunks <- ragnar_retrieve(store, text, top_k = 3)
cat("Retrieved", nrow(relevant_chunks), "chunks:\n\n")
Retrieved 6 chunks:
Code
for (i in seq_len(nrow(relevant_chunks))) {
    cat(sprintf("--- Chunk %d ---\n%s\n\n", i, relevant_chunks$text[i]))
}
--- Chunk 1 ---
[Zero Completion Insurance](/docs/features/zero-completion-insurance)
  + [Provisioning API Keys](/docs/features/provisioning-api-keys)
  + [App Attribution](/docs/app-attribution)
* API Reference

  + [Overview](/docs/api-reference/overview)
  + [Streaming](/docs/api-reference/streaming)
  + [Embeddings](/docs/api-reference/embeddings)
  + [Limits](/docs/api-reference/limits)
  + [Authentication](/docs/api-reference/authentication)
  + [Parameters](/docs/api-reference/parameters)
  + [Errors](/docs/api-reference/errors)
  + Responses API
  + beta.responses
  + Analytics
  + Credits
  + Embeddings
  + Generations
  + Models
  + Endpoints
  + Parameters
  + Providers
  + API Keys
  + O Auth
  + Chat
  + Completions
* SDK Reference (BETA)

  + [Python SDK](/docs/sdks/python)
  + [TypeScript SDK](/docs/sdks/typescript)
* Use Cases

  + [BYOK](/docs/use-cases/byok)
  + [Crypto API](/docs/use-cases/crypto-api)
  + [OAuth PKCE](/docs/use-cases/oauth-pkce)
  + [MCP Servers](/docs/use-cases/mcp-servers)
  + [Organization Management](/docs/use-cases/organization-management)
  + [For Providers](/docs/use-cases/for-providers)
  + [Reasoning Tokens](/docs/use-cases/reasoning-tokens)
  + [Usage Accounting](/docs/use-cases/usage-accounting)
  + [User Tracking](/docs/use-cases/user-tracking)
* Community

  + [Frameworks and Integrations Overview](/docs/community/frameworks-and-integrations-overview)
  + [Effect AI SDK](/docs/community/effect-ai-sdk)
  + [Arize](/docs/community/arize)
  + [LangChain](/docs/community/lang-chain)
  + [LiveKit](/docs/community/live-kit)
  + [Langfuse](/docs/community/langfuse)
  + [Mastra](/docs/community/mastra)
  + [OpenAI SDK](/docs/community/open-ai-sdk)
  + [PydanticAI](/docs/community/pydantic-ai)
  + [Vercel AI SDK](/docs/community/vercel-ai-sdk)
  + [Xcode](/docs/community/xcode)
  + [Zapier](/docs/community/zapier)
  + [Discord](https://discord.gg/openrouter)

Light

On this page

* [Requests](#requests)
* [Completions Request Format](#completions-request-format)
* [Headers](#headers)
* [Assistant Prefill](#assistant-prefill)
* [Responses](#responses)
* [CompletionsResponse Format](#completionsresponse-format)
* [Finish Reason](#finish-reason)
* [Querying Cost and Stats](#querying-cost-and-stats)

[API Reference](/docs/api-reference/overview)



--- Chunk 2 ---
###### How frequently are new models added?

We work on adding models as quickly as we can. We often have partnerships with
the labs releasing models and can release models as soon as they are
available. If there is a model missing that you’d like OpenRouter to support, feel free to message us on
[Discord](https://discord.gg/openrouter).

###### What are model variants?

Variants are suffixes that can be added to the model slug to change its behavior.

Static variants can only be used with specific models and these are listed in our [models api](https://openrouter.ai/api/v1/models).

1. `:free` - The model is always provided for free and has low rate limits.
2. `:beta` - The model is not moderated by OpenRouter.
3. `:extended` - The model has longer than usual context length.
4. `:exacto` - The model only uses OpenRouter-curated high-quality endpoints.
5. `:thinking` - The model supports reasoning by default.

Dynamic variants can be used on all models and they change the behavior of how the request is routed or used.

1. `:online` - All requests will run a query to extract web results that are attached to the prompt.
2. `:nitro` - Providers will be sorted by throughput rather than the default sort, optimizing for faster response times.
3. `:floor` - Providers will be sorted by price rather than the default sort, prioritizing the most cost-effective options.

###### I am an inference provider, how can I get listed on OpenRouter?

You can read our requirements at the [Providers
page](/docs/use-cases/for-providers). If you would like to contact us, the best
place to reach us is over email.

###### What is the expected latency/response time for different models?

For each model on OpenRouter we show the latency (time to first token) and the token
throughput for all providers. You can use this to estimate how long requests
will take. If you would like to optimize for throughput you can use the
`:nitro` variant to route to the fastest provider.



--- Chunk 3 ---
###### How frequently are new models added?

We work on adding models as quickly as we can. We often have partnerships with
the labs releasing models and can release models as soon as they are
available. If there is a model missing that you’d like OpenRouter to support, feel free to message us on
[Discord](https://discord.gg/openrouter).

###### What are model variants?

Variants are suffixes that can be added to the model slug to change its behavior.

Static variants can only be used with specific models and these are listed in our [models api](https://openrouter.ai/api/v1/models).

1. `:free` - The model is always provided for free and has low rate limits.
2. `:beta` - The model is not moderated by OpenRouter.
3. `:extended` - The model has longer than usual context length.
4. `:exacto` - The model only uses OpenRouter-curated high-quality endpoints.
5. `:thinking` - The model supports reasoning by default.

Dynamic variants can be used on all models and they change the behavior of how the request is routed or used.

1. `:online` - All requests will run a query to extract web results that are attached to the prompt.
2. `:nitro` - Providers will be sorted by throughput rather than the default sort, optimizing for faster response times.
3. `:floor` - Providers will be sorted by price rather than the default sort, prioritizing the most cost-effective options.

###### I am an inference provider, how can I get listed on OpenRouter?

You can read our requirements at the [Providers
page](/docs/use-cases/for-providers). If you would like to contact us, the best
place to reach us is over email.

###### What is the expected latency/response time for different models?

For each model on OpenRouter we show the latency (time to first token) and the token
throughput for all providers. You can use this to estimate how long requests
will take. If you would like to optimize for throughput you can use the
`:nitro` variant to route to the fastest provider.



--- Chunk 4 ---
## The `models` parameter

The `models` parameter lets you automatically try other models if the primary model’s providers are down, rate-limited, or refuse to reply due to content moderation.

TypeScript SDKTypeScript (fetch)Python

```code-block-root not-prose rounded-b-[inherit] rounded-t-none
|  |  |
| --- | --- |
| 1 | import { OpenRouter } from '@openrouter/sdk'; |
| 2 |  |
| 3 | const openRouter = new OpenRouter({ |
| 4 | apiKey: '<OPENROUTER_API_KEY>', |
| 5 | }); |
| 6 |  |
| 7 | const completion = await openRouter.chat.send({ |
| 8 | models: ['anthropic/claude-3.5-sonnet', 'gryphe/mythomax-l2-13b'], |
| 9 | messages: [ |
| 10 | { |
| 11 | role: 'user', |
| 12 | content: 'What is the meaning of life?', |
| 13 | }, |
| 14 | ], |
| 15 | }); |
| 16 |  |
| 17 | console.log(completion.choices[0].message.content); |
```

If the model you selected returns an error, OpenRouter will try to use the fallback model instead. If the fallback model is down or returns an error, OpenRouter will return that error.

By default, any error can trigger the use of a fallback model, including context length validation errors, moderation flags for filtered models, rate-limiting, and downtime.

Requests are priced using the model that was ultimately used, which will be returned in the `model` attribute of the response body.

## Using with OpenAI SDK

To use the `models` array with the OpenAI SDK, include it in the `extra_body` parameter. In the example below, gpt-4o will be tried first, and the `models` array will be tried in order as fallbacks.

PythonTypeScript



--- Chunk 5 ---
[Web Search](/docs/features/web-search)
  + [Zero Completion Insurance](/docs/features/zero-completion-insurance)
  + [Provisioning API Keys](/docs/features/provisioning-api-keys)
  + [App Attribution](/docs/app-attribution)
* API Reference

  + [Overview](/docs/api-reference/overview)
  + [Streaming](/docs/api-reference/streaming)
  + [Embeddings](/docs/api-reference/embeddings)
  + [Limits](/docs/api-reference/limits)
  + [Authentication](/docs/api-reference/authentication)
  + [Parameters](/docs/api-reference/parameters)
  + [Errors](/docs/api-reference/errors)
  + Responses API
  + beta.responses
  + Analytics
  + Credits
  + Embeddings
  + Generations
  + Models
  + Endpoints
  + Parameters
  + Providers
  + API Keys
  + O Auth
  + Chat
  + Completions
* SDK Reference (BETA)

  + [Python SDK](/docs/sdks/python)
  + [TypeScript SDK](/docs/sdks/typescript)
* Use Cases

  + [BYOK](/docs/use-cases/byok)
  + [Crypto API](/docs/use-cases/crypto-api)
  + [OAuth PKCE](/docs/use-cases/oauth-pkce)
  + [MCP Servers](/docs/use-cases/mcp-servers)
  + [Organization Management](/docs/use-cases/organization-management)
  + [For Providers](/docs/use-cases/for-providers)
  + [Reasoning Tokens](/docs/use-cases/reasoning-tokens)
  + [Usage Accounting](/docs/use-cases/usage-accounting)
  + [User Tracking](/docs/use-cases/user-tracking)
* Community

  + [Frameworks and Integrations Overview](/docs/community/frameworks-and-integrations-overview)
  + [Effect AI SDK](/docs/community/effect-ai-sdk)
  + [Arize](/docs/community/arize)
  + [LangChain](/docs/community/lang-chain)
  + [LiveKit](/docs/community/live-kit)
  + [Langfuse](/docs/community/langfuse)
  + [Mastra](/docs/community/mastra)
  + [OpenAI SDK](/docs/community/open-ai-sdk)
  + [PydanticAI](/docs/community/pydantic-ai)
  + [Vercel AI SDK](/docs/community/vercel-ai-sdk)
  + [Xcode](/docs/community/xcode)
  + [Zapier](/docs/community/zapier)
  + [Discord](https://discord.gg/openrouter)

Light

On this page

* [Within OpenRouter](#within-openrouter)
* [Provider Policies](#provider-policies)
* [Training on Prompts](#training-on-prompts)
* [Data Retention & Logging](#data-retention--logging)
* [Enterprise EU in-region routing](#enterprise-eu-in-region-routing)

[Features](/docs/features/privacy-and-logging)



--- Chunk 6 ---
[Web Search](/docs/features/web-search)
  + [Zero Completion Insurance](/docs/features/zero-completion-insurance)
  + [Provisioning API Keys](/docs/features/provisioning-api-keys)
  + [App Attribution](/docs/app-attribution)
* API Reference

  + [Overview](/docs/api-reference/overview)
  + [Streaming](/docs/api-reference/streaming)
  + [Embeddings](/docs/api-reference/embeddings)
  + [Limits](/docs/api-reference/limits)
  + [Authentication](/docs/api-reference/authentication)
  + [Parameters](/docs/api-reference/parameters)
  + [Errors](/docs/api-reference/errors)
  + Responses API
  + beta.responses
  + Analytics
  + Credits
  + Embeddings
  + Generations
  + Models
  + Endpoints
  + Parameters
  + Providers
  + API Keys
  + O Auth
  + Chat
  + Completions
* SDK Reference (BETA)

  + [Python SDK](/docs/sdks/python)
  + [TypeScript SDK](/docs/sdks/typescript)
* Use Cases

  + [BYOK](/docs/use-cases/byok)
  + [Crypto API](/docs/use-cases/crypto-api)
  + [OAuth PKCE](/docs/use-cases/oauth-pkce)
  + [MCP Servers](/docs/use-cases/mcp-servers)
  + [Organization Management](/docs/use-cases/organization-management)
  + [For Providers](/docs/use-cases/for-providers)
  + [Reasoning Tokens](/docs/use-cases/reasoning-tokens)
  + [Usage Accounting](/docs/use-cases/usage-accounting)
  + [User Tracking](/docs/use-cases/user-tracking)
* Community

  + [Frameworks and Integrations Overview](/docs/community/frameworks-and-integrations-overview)
  + [Effect AI SDK](/docs/community/effect-ai-sdk)
  + [Arize](/docs/community/arize)
  + [LangChain](/docs/community/lang-chain)
  + [LiveKit](/docs/community/live-kit)
  + [Langfuse](/docs/community/langfuse)
  + [Mastra](/docs/community/mastra)
  + [OpenAI SDK](/docs/community/open-ai-sdk)
  + [PydanticAI](/docs/community/pydantic-ai)
  + [Vercel AI SDK](/docs/community/vercel-ai-sdk)
  + [Xcode](/docs/community/xcode)
  + [Zapier](/docs/community/zapier)
  + [Discord](https://discord.gg/openrouter)

Light

On this page

* [How OpenRouter Manages Data Policies](#how-openrouter-manages-data-policies)
* [Per-Request ZDR Enforcement](#per-request-zdr-enforcement)
* [Usage](#usage)
* [Caching](#caching)
* [OpenRouter’s Retention Policy](#openrouters-retention-policy)
* [Zero Retention Endpoints](#zero-retention-endpoints)

[Features](/docs/features/privacy-and-logging)
Code
# ragnar_store_inspect(store)
#ragnar_chunks_view(chunks)

In Python, we can use LlamaIndex to interact with our DuckDB vector store. In this step, we’ll configure the embedding model and retrieve the top relevant chunks for our query, saving them to a file for inspection. We won’t use an LLM for generation yet, focusing solely on verifying the retrieval quality.

Question:What are model variants?

RAG result:

Code
import os
import sys
print(f"Python executable: {sys.executable}")
Python executable: /Library/Frameworks/Python.framework/Versions/3.13/bin/python3
Code
print(f"Python path: {sys.path}")
Python path: ['', '/Library/Frameworks/Python.framework/Versions/3.13/bin', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python313.zip', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/lib-dynload', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages', '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/reticulate/python']
Code
from typing import Any, List
from openai import OpenAI
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
True
Code
# Ensure your API key is available
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")

# Custom OpenRouter Embedding class for LlamaIndex
class OpenRouterEmbedding(BaseEmbedding):
    """Custom embedding class for OpenRouter API compatible with LlamaIndex."""
    
    def __init__(
        self,
        api_key: str,
        model: str = "qwen/qwen3-embedding-8b",
        **kwargs: Any
    ):
        super().__init__(**kwargs)
        self._client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self._model = model
    
    def _get_query_embedding(self, query: str) -> List[float]:
        """Get embedding for a query string."""
        response = self._client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self._model,
            input=query,
            encoding_format="float"
        )
        return response.data[0].embedding
    
    def _get_text_embedding(self, text: str) -> List[float]:
        """Get embedding for a text string."""
        return self._get_query_embedding(text)
    
    async def _aget_query_embedding(self, query: str) -> List[float]:
        """Async version of get_query_embedding."""
        return self._get_query_embedding(query)
    
    async def _aget_text_embedding(self, text: str) -> List[float]:
        """Async version of get_text_embedding."""
        return self._get_text_embedding(text)

# 1. Configure Embedding Model using custom OpenRouter class
embed_model = OpenRouterEmbedding(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# 2. Apply Settings
Settings.embed_model = embed_model

# 3. Load and Retrieve
# Load the existing DuckDB vector store
print("Loading vector store from openrouter.duckdb...")
Loading vector store from openrouter.duckdb...
Code
vector_store = DuckDBVectorStore(database_name="openrouter.duckdb",persist_dir="./persist/")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# Define query
query = "What are model variants?"
print(f"\n{'='*60}")

============================================================
Code
print(f"Query: '{query}'")
Query: 'What are model variants?'
Code
print(f"{'='*60}\n")
============================================================
Code
# Retrieve top 3 chunks
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(query)

# Print detailed retrieval info
print(f"Retrieved {len(nodes)} chunks from DuckDB:\n")
Retrieved 5 chunks from DuckDB:
Code
for i, node in enumerate(nodes, 1):
    print(f"{'─'*60}")
    print(f"Chunk {i}")
    print(f"{'─'*60}")

    # Print similarity score
    if hasattr(node, 'score'):
        print(f"Similarity Score: {node.score:.4f}")

    # Print metadata
    if hasattr(node, 'metadata') and node.metadata:
        print(f"Metadata:")
        for key, value in node.metadata.items():
            print(f"  - {key}: {value}")

    # Print text content (truncated for display)
    text_preview = node.text[:500] + "..." if len(node.text) > 500 else node.text
    print(f"\nContent:\n{text_preview}\n")
────────────────────────────────────────────────────────────
Chunk 1
────────────────────────────────────────────────────────────
Similarity Score: 0.6169
Metadata:
  - file_path: /Users/jinchaoduan/Documents/post_project/AI_Blog/posts/RAG/markdown_docs/docs_features_exacto-variant.md
  - file_name: docs_features_exacto-variant.md
  - file_type: text/markdown
  - file_size: 7972
  - creation_date: 2025-11-21
  - last_modified_date: 2025-11-21

Content:
Search

`/`

Ask AI

[API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)

* Overview

  + [Quickstart](/docs/quickstart)
  + [FAQ](/docs/faq)
  + [Principles](/docs/overview/principles)
  + [Models](/docs/overview/models)
  + [Enterprise](https://openrouter.ai/enterprise)
* Features

  + [Privacy and Logging](/docs/features/privacy-and-logging)
  + [Zero Data Retention (ZDR)](/docs/features/zdr)
  + ...

────────────────────────────────────────────────────────────
Chunk 2
────────────────────────────────────────────────────────────
Similarity Score: 0.6099
Metadata:
  - file_path: /Users/jinchaoduan/Documents/post_project/AI_Blog/posts/RAG/markdown_docs/docs_overview_models.md
  - file_name: docs_overview_models.md
  - file_type: text/markdown
  - file_size: 9021
  - creation_date: 2025-11-21
  - last_modified_date: 2025-11-21

Content:
Search

`/`

Ask AI

[API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)

* Overview

  + [Quickstart](/docs/quickstart)
  + [FAQ](/docs/faq)
  + [Principles](/docs/overview/principles)
  + [Models](/docs/overview/models)
  + [Enterprise](https://openrouter.ai/enterprise)
* Features

  + [Privacy and Logging](/docs/features/privacy-and-logging)
  + [Zero Data Retention (ZDR)](/docs/features/zdr)
  + ...

────────────────────────────────────────────────────────────
Chunk 3
────────────────────────────────────────────────────────────
Similarity Score: 0.5820
Metadata:
  - file_path: /Users/jinchaoduan/Documents/post_project/AI_Blog/posts/RAG/markdown_docs/docs_guides_overview_models.md
  - file_name: docs_guides_overview_models.md
  - file_type: text/markdown
  - file_size: 7557
  - creation_date: 2026-01-06
  - last_modified_date: 2026-01-06

Content:
Search

`/`

Ask AI

[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)[Docs](/docs/api-reference/overview)

[Docs](/docs/quickstart)[API Reference](/docs/api/reference/overview)[SDK Reference](/docs/sdks/call-model/overview)

[Docs](/docs/quickstart)[API Reference](/docs/api/reference/overview)[SDK Reference](/docs/sdks/call-model/overview)

* Overview

  + [Quickstart](/docs/quickstart)
  + [Principles](/docs/guides/overview/princi...

────────────────────────────────────────────────────────────
Chunk 4
────────────────────────────────────────────────────────────
Similarity Score: 0.5763
Metadata:
  - file_path: /Users/jinchaoduan/Documents/post_project/AI_Blog/posts/RAG/markdown_docs/docs_faq_how-are-rate-limits-calculated.md
  - file_name: docs_faq_how-are-rate-limits-calculated.md
  - file_type: text/markdown
  - file_size: 17710
  - creation_date: 2026-01-06
  - last_modified_date: 2026-01-06

Content:
Search

`/`

Ask AI

[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)[Docs](/docs/api-reference/overview)

[Docs](/docs/quickstart)[API Reference](/docs/api/reference/overview)[SDK Reference](/docs/sdks/call-model/overview)

[Docs](/docs/quickstart)[API Reference](/docs/api/reference/overview)[SDK Reference](/docs/sdks/call-model/overview)

* Overview

  + [Quickstart](/docs/quickstart)
  + [Principles](/docs/guides/overview/princi...

────────────────────────────────────────────────────────────
Chunk 5
────────────────────────────────────────────────────────────
Similarity Score: 0.5701
Metadata:
  - file_path: /Users/jinchaoduan/Documents/post_project/AI_Blog/posts/RAG/markdown_docs/docs_features_model-routing.md
  - file_name: docs_features_model-routing.md
  - file_type: text/markdown
  - file_size: 7024
  - creation_date: 2025-11-21
  - last_modified_date: 2025-11-21

Content:
Search

`/`

Ask AI

[API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)

* Overview

  + [Quickstart](/docs/quickstart)
  + [FAQ](/docs/faq)
  + [Principles](/docs/overview/principles)
  + [Models](/docs/overview/models)
  + [Enterprise](https://openrouter.ai/enterprise)
* Features

  + [Privacy and Logging](/docs/features/privacy-and-logging)
  + [Zero Data Retention (ZDR)](/docs/features/zdr)
  + ...
Code

# Save retrieved chunks to a markdown file for easy inspection
# with open("retriever.md", "w", encoding="utf-8") as f:
#     f.write(f"# Query: {query}\n\n")
#     f.write(f"# Retrieved {len(nodes)} chunks from openrouter.duckdb\n\n")
#     for i, node in enumerate(nodes, 1):
#         f.write(f"{'─'*60}\n")
#         f.write(f"## Chunk {i}\n\n")
#         if hasattr(node, 'score'):
#             f.write(f"**Similarity Score:** {node.score:.4f}\n\n")
#         if hasattr(node, 'metadata') and node.metadata:
#             f.write(f"**Metadata:**\n")
#             for key, value in node.metadata.items():
#                 f.write(f"- {key}: {value}\n")
#             f.write(f"\n")
#         f.write(f"{node.text}\n\n")

Now that our knowledge base is populated, we can test the retrieval system. We can ask a specific question, like “What are model variants?”, and query the Chroma store to see which text chunks are most relevant. This confirms that our embeddings and search are working correctly.

Question:What are model variants?

RAG result:

Code
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()
True
Code
# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "qwen/qwen3-embedding-8b"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"

# Load existing vector store
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)

# Test query
query = "What are model variants?"

# Perform similarity search
results = vectorstore.similarity_search(query, k=5)

print(f"\nQuery: '{query}'")

Query: 'What are model variants?'
Code
print(f"Found {len(results)} relevant chunks:\n")
Found 5 relevant chunks:
Code
for i, doc in enumerate(results, 1):
    print(f"Result {i}:")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Content preview: {doc.page_content[:800]}...")
Result 1:
Source: docs_faq_how-are-rate-limits-calculated.md
Content preview: ###### What are model variants?

Variants are suffixes that can be added to the model slug to change its behavior.

Static variants can only be used with specific models and these are listed in our [models api](https://openrouter.ai/api/v1/models).

1. `:free` - The model is always provided for free and has low rate limits.
2. `:beta` - The model is not moderated by OpenRouter.
3. `:extended` - The model has longer than usual context length.
4. `:exacto` - The model only uses OpenRouter-curated high-quality endpoints.
5. `:thinking` - The model supports reasoning by default.

Dynamic variants can be used on all models and they change the behavior of how the request is routed or used.

1. `:online` - All requests will run a query to extract web results that are attached to the prompt.
2. `:...
Result 2:
Source: docs_faq.md
Content preview: ###### What are model variants?

Variants are suffixes that can be added to the model slug to change its behavior.

Static variants can only be used with specific models and these are listed in our [models api](https://openrouter.ai/api/v1/models).

1. `:free` - The model is always provided for free and has low rate limits.
2. `:beta` - The model is not moderated by OpenRouter.
3. `:extended` - The model has longer than usual context length.
4. `:exacto` - The model only uses OpenRouter-curated high-quality endpoints.
5. `:thinking` - The model supports reasoning by default.

Dynamic variants can be used on all models and they change the behavior of how the request is routed or used.

1. `:online` - All requests will run a query to extract web results that are attached to the prompt.
2. `:...
Result 3:
Source: docs_use-cases_crypto-api.md
Content preview: [API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)...
Result 4:
Source: docs_sdks_typescript.md
Content preview: [API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)...
Result 5:
Source: docs_features_provider-routing.md
Content preview: Route requests through OpenRouter-curated providers

Next](/docs/features/exacto-variant)[Built with](https://buildwithfern.com/?utm_campaign=buildWith&utm_medium=docs&utm_source=openrouter.ai)

[![Logo](https://files.buildwithfern.com/openrouter.docs.buildwithfern.com/docs/2025-11-21T16:36:36.134Z/content/assets/logo.svg)![Logo](https://files.buildwithfern.com/openrouter.docs.buildwithfern.com/docs/2025-11-21T16:36:36.134Z/content/assets/logo-white.svg)](https://openrouter.ai/)

[API](/docs/api-reference/overview)[Models](https://openrouter.ai/models)[Chat](https://openrouter.ai/chat)[Ranking](https://openrouter.ai/rankings)...

Chat with RAG

  • R
  • Python chatlas
  • Python LangChain

The final piece is to connect this retrieval capability to a chat interface. We use ellmer to create a chat client. Crucially, we register a “retrieval tool” using ragnar_register_tool_retrieve. This gives the LLM the ability to query our vector store whenever it needs information to answer a user’s question.

We also provide a system prompt that instructs the model to always check the knowledge base and cite its sources.

Code
library(ellmer)
library(dotenv)
load_dot_env(file = ".env")

chat <- chat_openrouter(
    api_key = Sys.getenv("OPENROUTER_API_KEY"),
    model = "openai/gpt-oss-120b",
    system_prompt = glue::trim("
  You are an assistant for question-answering tasks. You are concise.

  Before responding, retrieve relevant material from the knowledge store. Quote or
  paraphrase passages, clearly marking your own words versus the source. Provide a
  working link for every source cited, as well as any additional relevant links.
  Do not answer unless you have retrieved and cited a source.If you do not find
  relevant information, say 'I could not find anything relevant in the knowledge base
    ")
) |>
    ragnar_register_tool_retrieve(store, top_k = 3)

Question:What are model variants?

Code
chat$chat("What are model variants?")

Model variants are suffixes that you attach to a model’s slug in order to modify how the model is provided or how the request is processed.

  • Static variants work only with certain models (the allowed combinations are listed in the /models API). They change the model’s characteristics, for example:

    1. :free – the model is always offered at no cost and has low rate‑limits.

    2. :beta – the model is served without OpenRouter’s moderation.

    3. :extended – the model gets a longer‑than‑usual context window.

    4. :exacto – only high‑quality, OpenRouter‑curated endpoints are used.

    5. :thinking – the model has reasoning (chain‑of‑thought) enabled by default.

  • Dynamic variants can be applied to any model and affect routing or extra processing:

    1. :online – each request performs a web‑search query and appends the results to the prompt.
    2. :nitro – providers are sorted by throughput, favoring the fastest responses.
    3. :floor – providers are sorted by price, favoring the cheapest options.

In short, a variant is a suffix like :free, :online, etc., that you add to a model identifier (e.g., anthropic/claude-3.5-sonnet:free) to obtain a particular pricing, moderation, context‑length, reasoning, or routing behavior.

Source: OpenRouter FAQ – “What are model variants?” which lists the static and dynamic variants and explains that they are suffixes that change a model’s behavior.

We can also use the chatlas library to create a chat interface. Here, we define a custom tool retrieve_trusted_content that queries our DuckDB index. We then register this tool with the chat model, allowing it to pull in relevant information when answering user questions.

Question:What are model variants?

Code
import os
from typing import Any, List
from openai import OpenAI
import chatlas as ctl
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from dotenv import load_dotenv
load_dotenv()
True
Code
# Ensure API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")

# Custom OpenRouter Embedding class for LlamaIndex
class OpenRouterEmbedding(BaseEmbedding):
    """Custom embedding class for OpenRouter API compatible with LlamaIndex."""

    def __init__(
        self,
        api_key: str,
        model: str = "qwen/qwen3-embedding-8b",
        **kwargs: Any
    ):
        super().__init__(**kwargs)
        self._client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self._model = model

    def _get_query_embedding(self, query: str) -> List[float]:
        """Get embedding for a query string."""
        response = self._client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self._model,
            input=query,
            encoding_format="float"
        )
        return response.data[0].embedding

    def _get_text_embedding(self, text: str) -> List[float]:
        """Get embedding for a text string."""
        return self._get_query_embedding(text)

    async def _aget_query_embedding(self, query: str) -> List[float]:
        """Async version of get_query_embedding."""
        return self._get_query_embedding(query)

    async def _aget_text_embedding(self, text: str) -> List[float]:
        """Async version of get_text_embedding."""
        return self._get_text_embedding(text)

# 1. Configure Embedding Model using custom OpenRouter class
# (Note: must use the same embedding model that was used to create the database)
embed_model = OpenRouterEmbedding(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)
Settings.embed_model = embed_model

# 2. Load Vector Store
vector_store = DuckDBVectorStore(database_name="openrouter.duckdb",persist_dir="./persist/")
index = VectorStoreIndex.from_vector_store(vector_store)

# 3. Define Retrieval Tool
def retrieve_trusted_content(query: str, top_k: int = 8):
    """
    Retrieve relevant content from the knowledge store.

    Parameters
    ----------
    query
        The query used to semantically search the knowledge store.
    top_k
        The number of results to retrieve from the knowledge store.
    """
    #print(f"Retrieving content for query: '{query}'")
    retriever = index.as_retriever(similarity_top_k=top_k)
    nodes = retriever.retrieve(query)
    return [f"<excerpt>{x.text}</excerpt>" for x in nodes]

# 4. Initialize Chat with System Prompt
chat = ctl.ChatOpenRouter(
    model="openai/gpt-oss-120b",
    api_key=openrouter_api_key,
    base_url="https://openrouter.ai/api/v1",
    system_prompt=(
        "You are an assistant for question-answering tasks. "
        "Use the retrieve_trusted_content tool to find relevant information from the knowledge store. "
        "Answer questions based on the retrieved content. "
        "If you cannot find relevant information, say so clearly."
    )
)

chat.register_tool(retrieve_trusted_content)
Code
#retrieve_trusted_content("What are model variants?",top_k=3)
Code
response=chat.chat("What are model variants?", echo="none")
Code
print(response)
**Model variants are special suffixes you can tack onto a model’s slug to change how the model is served or how the request is routed.**  
Instead of treating a model name like a single, immutable identifier, OpenRouter lets you add a “variant” after a colon (`:`) – e.g. `anthropic/claude-3.5-sonnet:free` – that tells the platform to apply a particular preset of behavior.

---

### Two families of variants

| Variant type | How it works | Examples |
|--------------|--------------|----------|
| **Static variants** | Hard‑coded to specific models. Only the models that support the variant will accept it. | `:free` – always free (low‑rate‑limit) <br> `:extended` – longer context window than the default <br> `:exacto` – routes only to OpenRouter‑curated high‑quality providers <br> `:thinking` – enables built‑in reasoning mode |
| **Dynamic variants** | Apply to *any* model and affect the routing or request processing rather than the model itself. | `:online` – fetches web results and appends them to the prompt <br> `:nitro` – sorts providers by throughput (fastest first) <br> `:floor` – sorts providers by price (cheapest first) |

*Source: OpenRouter FAQ “What are model variants?”*【5†L31-L45】.

---

### What each static variant does

| Variant | Effect |
|---------|--------|
| **`:free`** | The model is offered at no cost and comes with a low rate‑limit. Good for quick experiments when price is the primary concern. |
| **`:extended`** | Gives the model a longer context length than the standard version, useful for very long prompts or multi‑turn conversations. |
| **`:exacto`** | Forces the request to go only through a vetted subset of providers that have shown higher tool‑calling accuracy and reliability (the “Exacto” routing). |
| **`:thinking`** | Turns on the model’s built‑in reasoning capability, so it automatically performs chain‑of‑thought style reasoning without extra flags. |

*Source: OpenRouter FAQ “Static variants …”*【5†L31-L38】.

---

### What each dynamic variant does

| Variant | Effect |
|---------|--------|
| **`:online`** | Executes a web‑search, attaches the retrieved results to the prompt, and then runs the model – ideal for up‑to‑date knowledge retrieval. |
| **`:nitro`** | Provider selection is sorted by throughput (speed) rather than the default sort, giving you the fastest possible response. |
| **`:floor`** | Provider selection is sorted by price, so the cheapest eligible endpoint is chosen first. |

*Source: OpenRouter FAQ “Dynamic variants …”*【5†L39-L45】.

---

### How to use a variant

Just add the suffix to the model identifier in any API request (chat, completions, etc.):

```json
{
  "model": "anthropic/claude-3.5-sonnet:extended",
  "messages": [{ "role": "user", "content": "Explain quantum computing in 500 words." }]
}
```

If you use a dynamic variant, the same slug works with any model:

```json
{
  "model": "openai/gpt-4:nitro",
  "messages": [{ "role": "user", "content": "Generate a fast response to this query." }]
}
```

---

### Why variants matter

- **Fine‑grained control** – Choose free usage, longer context, or higher reliability without changing code.  
- **Routing optimization** – Direct traffic to the fastest or cheapest providers (`:nitro`, `:floor`).  
- **Feature toggles** – Enable tool‑calling accuracy (`:exacto`) or web‑search (`:online`) on the fly.

In short, model variants are a convenient, URL‑style way to modify a model’s pricing, capabilities, or routing behavior by appending a suffix to the model slug.

The final piece is to connect this retrieval capability to a chat interface. We use LangChain to create a RAG chain. We create a retriever from our vector store and combine it with a ChatOpenAI model (using OpenRouter) and a prompt template. This gives the LLM the ability to query our vector store whenever it needs information to answer a user’s question.

Question:What are model variants?

Code
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()
True
Code
# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "qwen/qwen3-embedding-8b"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"

# Load existing vector store
print(f"Loading existing vectorstore from {persist_directory}...")
Loading existing vectorstore from chroma_db_data...
Code
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)
print(f"✓ Loaded vectorstore")
✓ Loaded vectorstore
Code
# Initialize LLM using OpenRouter
llm = ChatOpenAI(
    model="openai/gpt-oss-120b",
    openai_api_key=os.getenv("OPENROUTER_API_KEY"),
    openai_api_base="https://openrouter.ai/api/v1"
)

# Create prompt template
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "Context: {context}"
    "\n\n"
    "Question: {question}"
)

prompt = ChatPromptTemplate.from_template(system_prompt)

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create RAG chain using LCEL
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print("✓ RAG chain created successfully!")
✓ RAG chain created successfully!
Code
# Test the RAG chain
question = "What are model variants?"

print(f"\nQuestion: {question}")

Question: What are model variants?
Code
# Get context documents separately for display
context_docs = retriever.invoke(question)

# Invoke the RAG chain
answer = rag_chain.invoke(question)
import textwrap

for line in answer.split('\n'):
    print(textwrap.fill(line, width=80))
Model variants are suffixes you append to a model’s slug to modify its
behavior. Static variants
(e.g., `:free`, `:beta`, `:extended`, `:exacto`, `:thinking`) apply only to
certain models, while dynamic variants (e.g., `:online`, `:nitro`, `:floor`)
work with any model to affect routing or features. These variants let you adjust
cost, speed, context length, moderation, reasoning, or web‑search capabilities.
Source Code
---
title: "Retrieval-Augmented Generation(RAG) in R & Python"
author: "Tony D"
date: "2025-11-02"
categories: [AI, API, tutorial]
image: "images.png"

format:
  html:
    code-fold: true
    code-tools: true
    code-copy: true

execute:

  warning: false
---


```{r setup, include=FALSE}
library(reticulate)

# Use Python 3.13
use_python("/Library/Frameworks/Python.framework/Versions/3.13/bin/python3", required = TRUE)

# Install required packages
py_install(c("openai", "langchain-core", "langchain-chroma", "langchain-community", "langchain-openai", "markitdown", "beautifulsoup4", "python-dotenv", "llama-index", "llama-index-vector-stores-duckdb"), pip = TRUE)

# Verify Python
py_config()
```



# Introduction

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the generative capabilities of Large Language Models (LLMs) with the precision of information retrieval. By grounding LLM responses in external, verifiable data, RAG reduces hallucinations and enables the model to answer questions about specific, private, or up-to-date information.

In this tutorial, we will build a RAG system using both R and Python. 

In R, we'll leverage the `ragnar` package for RAG workflows and `ellmer` for chat interfaces. 

In Python, we'll use `LangChain` for the RAG pipeline, `ChromaDB` for the vector store, and `OpenAI` for model interaction.

Our goal is to create a system that can answer questions about the OpenRouter API by scraping its documentation.

# Data Collection

::: {.panel-tabset}


## R

First, we need to gather the data for our knowledge base. We'll use the `rvest` package to scrape URLs from the OpenRouter documentation. This will give us a list of pages to ingest.


```{r}
library(ragnar)
library(ellmer)
library(dotenv)
load_dot_env(file = ".env")
```

```{r}
library(rvest)

# URL to scrape
url <- "https://openrouter.ai/docs/quickstart"

# Read the HTML content of the page
page <- read_html(url)

# Extract all <a> tags with href
links <- page %>%
    html_nodes("a") %>%
    html_attr("href")

# Remove NAs and duplicates
links <- unique(na.omit(links))

# Optional: keep only full URLs
links_full <- paste0("https://openrouter.ai", links[grepl("^/docs/", links)])

# Print all links
print(links_full)
```


## Python 

First, we need to gather the data for our knowledge base. We'll use `requests` and `BeautifulSoup` to scrape URLs from the OpenRouter documentation. This will give us a list of pages to ingest.

```{python}
import sys
print(sys.executable)
```

```{python}
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
from markitdown import MarkItDown
from io import BytesIO
import re

# Load environment variables
load_dotenv()

# Helper functions
def fetch_html(url: str) -> bytes:
    """Fetch HTML content from URL and return as bytes."""
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.content

def html_to_markdown(html_bytes: bytes) -> str:
    """Convert HTML bytes to markdown using MarkItDown."""
    md = MarkItDown()
    stream = BytesIO(html_bytes)
    result = md.convert_stream(stream, mime_type="text/html")
    return result.markdown

def save_markdown(md_content: str, output_path: str):
    """Save markdown content to file."""
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(md_content)

def sanitize_filename(filename: str) -> str:
    """Sanitize URL to create a valid filename."""
    filename = re.sub(r'^https?://[^/]+', '', filename)
    filename = re.sub(r'[^\w\-_.]', '_', filename)
    filename = filename.strip('_')
    if not filename.endswith('.md'):
        filename += '.md'
    return filename

# URL to scrape
url = "https://openrouter.ai/docs/quickstart"

# Read the HTML content of the page
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract all <a> tags with href
links = [a['href'] for a in soup.find_all('a', href=True)]

# Remove duplicates
links = list(set(links))

# Keep only full URLs for docs
links_full = [f"https://openrouter.ai{link}" for link in links if link.startswith("/docs/")]

# Explicitly add FAQ
links_full.append("https://openrouter.ai/docs/faq")
links_full = list(set(links_full))

# Print all links
print(f"Found {len(links_full)} documentation URLs")
print(links_full)
```


:::

# Save web to local data

::: {.panel-tabset}

## R

To perform semantic search, we need to store our text data as vectors (embeddings). We'll use `DuckDB` as our local vector store. We also need an embedding model to convert text into vectors. Here, we configure `ragnar` to use a specific embedding model via an OpenAI-compatible API (SiliconFlow).


```{r}
#| eval: false
# pages <- ragnar_find_links(base_url)
pages <- links_full
store_location <- "openrouter.duckdb"

store <- ragnar_store_create(
    store_location,
    overwrite = TRUE,
    embed = \(x) ragnar::embed_openai(x,
        model = "BAAI/bge-m3",
        base_url = "https://api.siliconflow.cn/v1",
        api_key = Sys.getenv("siliconflow")
    )
)
```


With our store initialized, we can now ingest the data. We iterate through the list of pages we scraped earlier. For each page, we:
1.  Read the content as markdown.
2.  Split the content into smaller chunks (approx. 600 characters).
3.  Insert these chunks into our vector store.

This process builds the index that we'll search against.


```{r}
# page="https://openrouter.ai/docs/faq"
# chunks <- page |>read_as_markdown() |>markdown_chunk(target_size = 2000)
# ragnar_chunks_view(chunks)
```


```{r}
#| eval: false
for (page in pages) {
    message("ingesting: ", page)
    print(page)
    chunks <- page |>
        read_as_markdown() |>
        markdown_chunk(target_size = 2000)
    # print(chunks)
    # print('chunks done')
    ragnar_store_insert(store, chunks)
    print("insrt done")
}
```



```{r}
#| eval: false
ragnar_store_build_index(store)
```



## Python DuckDB
```{python}
#| eval: false
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding

# --- 1. Configuration ---

# Ensure your API key is available
openrouter_api_key = os.getenv("OPENROUTER_API_KEY") # or paste string directly

# Initialize the embedding model pointing to OpenRouter
# We use the OpenAI class because OpenRouter uses an OpenAI-compatible API structure
embed_model = OpenAIEmbedding(
    api_key=openrouter_api_key,
    base_url="https://openrouter.ai/api/v1",
    model="qwen/qwen3-embedding-8b"  
)

# Update the global settings so LlamaIndex knows to use this model
Settings.embed_model = embed_model
Settings.chunk_size = 2000
Settings.chunk_overlap = 200
# --- 2. Ingestion and Indexing ---

# Load data
documents = SimpleDirectoryReader("markdown_docs").load_data()

# Initialize DuckDB Vector Store
vector_store = DuckDBVectorStore("openrouter.duckdb", persist_dir="./persist/")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the index
# This will now automatically use the Qwen embeddings defined in Settings
index = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context
)

```





## Python Chroma

To perform semantic search, we need to store our text data as vectors (embeddings). We'll use `ChromaDB` as our local vector store. We also need an embedding model to convert text into vectors. Here, we configure a custom `OpenRouterEmbeddings` class to use the `qwen/qwen3-embedding-8b` model via the OpenRouter API.


```{python}
#| eval: false
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()

# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"
```




With our store initialized, we can now ingest the data. We iterate through the markdown files we saved earlier. For each file, we:
1.  Load the content.
2.  Split the content into smaller chunks (approx. 2000 characters) using `RecursiveCharacterTextSplitter`.
3.  Create a new `Chroma` vector store from these chunks.

This process builds the index that we'll search against.



```{python}
#| eval: false
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import shutil

# Helper function to load markdown files
def load_markdown_files(directory: str) -> list[Document]:
    """Load all markdown files from directory and create Document objects."""
    documents = []
    if not os.path.exists(directory):
        return documents
        
    for filename in os.listdir(directory):
        if filename.endswith('.md'):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
                doc = Document(
                    page_content=content,
                    metadata={
                        "source": filename,
                        "filepath": filepath
                    }
                )
                documents.append(doc)
    return documents

# Create output directory for markdown files
output_dir = "markdown_docs"
os.makedirs(output_dir, exist_ok=True)

# Convert each URL to markdown and save
for i, link_url in enumerate(links_full, 1):
    try:
        print(f"Processing {i}/{len(links_full)}: {link_url}")
        html_content = fetch_html(link_url)
        markdown_content = html_to_markdown(html_content)
        filename = sanitize_filename(link_url)
        output_path = os.path.join(output_dir, filename)
        save_markdown(markdown_content, output_path)
        print(f"  ✓ Saved to {output_path}")
    except Exception as e:
        print(f"  ✗ Error processing {link_url}: {str(e)}")

# Load markdown documents
documents = load_markdown_files(output_dir)
print(f"\nLoaded {len(documents)} markdown documents")

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

splits = text_splitter.split_documents(documents)
print(f"Split into {len(splits)} chunks")
```


```{python}
#| eval: false
# Remove existing database if it exists
if os.path.exists(persist_directory):
    print(f"Removing existing database at {persist_directory}...")
    shutil.rmtree(persist_directory)

# Create new vector store
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=persist_directory
)

print(f"\n✓ Successfully created ChromaDB with {len(splits)} chunks!")
print(f"✓ Database saved to: {persist_directory}")
```



:::










# Retrieval

::: {.panel-tabset}

## R

Now that our knowledge base is populated, we can test the retrieval system. We can ask a specific question, like "What are model variants?", and query the store to see which text chunks are most relevant. This confirms that our embeddings and search are working correctly.

### Question:What are model variants?

RAG result:

```{r}
store_location <- "openrouter.duckdb"
store <- ragnar_store_connect(store_location)

text <- "What are model variants?"

#' # Retrieving Chunks
#' Once the store is set up, retrieve the most relevant text chunks like this

relevant_chunks <- ragnar_retrieve(store, text, top_k = 3)
cat("Retrieved", nrow(relevant_chunks), "chunks:\n\n")
for (i in seq_len(nrow(relevant_chunks))) {
    cat(sprintf("--- Chunk %d ---\n%s\n\n", i, relevant_chunks$text[i]))
}
```


```{r}
# ragnar_store_inspect(store)
#ragnar_chunks_view(chunks)
```

## Python DuckDB

In Python, we can use `LlamaIndex` to interact with our DuckDB vector store. In this step, we'll configure the embedding model and retrieve the top relevant chunks for our query, saving them to a file for inspection. We won't use an LLM for generation yet, focusing solely on verifying the retrieval quality.

### Question:What are model variants?

RAG result:

```{python}
import os
import sys
print(f"Python executable: {sys.executable}")
print(f"Python path: {sys.path}")
from typing import Any, List
from openai import OpenAI
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Ensure your API key is available
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")

# Custom OpenRouter Embedding class for LlamaIndex
class OpenRouterEmbedding(BaseEmbedding):
    """Custom embedding class for OpenRouter API compatible with LlamaIndex."""
    
    def __init__(
        self,
        api_key: str,
        model: str = "qwen/qwen3-embedding-8b",
        **kwargs: Any
    ):
        super().__init__(**kwargs)
        self._client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self._model = model
    
    def _get_query_embedding(self, query: str) -> List[float]:
        """Get embedding for a query string."""
        response = self._client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self._model,
            input=query,
            encoding_format="float"
        )
        return response.data[0].embedding
    
    def _get_text_embedding(self, text: str) -> List[float]:
        """Get embedding for a text string."""
        return self._get_query_embedding(text)
    
    async def _aget_query_embedding(self, query: str) -> List[float]:
        """Async version of get_query_embedding."""
        return self._get_query_embedding(query)
    
    async def _aget_text_embedding(self, text: str) -> List[float]:
        """Async version of get_text_embedding."""
        return self._get_text_embedding(text)

# 1. Configure Embedding Model using custom OpenRouter class
embed_model = OpenRouterEmbedding(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# 2. Apply Settings
Settings.embed_model = embed_model

# 3. Load and Retrieve
# Load the existing DuckDB vector store
print("Loading vector store from openrouter.duckdb...")
vector_store = DuckDBVectorStore(database_name="openrouter.duckdb",persist_dir="./persist/")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# Define query
query = "What are model variants?"
print(f"\n{'='*60}")
print(f"Query: '{query}'")
print(f"{'='*60}\n")

# Retrieve top 3 chunks
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(query)

# Print detailed retrieval info
print(f"Retrieved {len(nodes)} chunks from DuckDB:\n")
for i, node in enumerate(nodes, 1):
    print(f"{'─'*60}")
    print(f"Chunk {i}")
    print(f"{'─'*60}")

    # Print similarity score
    if hasattr(node, 'score'):
        print(f"Similarity Score: {node.score:.4f}")

    # Print metadata
    if hasattr(node, 'metadata') and node.metadata:
        print(f"Metadata:")
        for key, value in node.metadata.items():
            print(f"  - {key}: {value}")

    # Print text content (truncated for display)
    text_preview = node.text[:500] + "..." if len(node.text) > 500 else node.text
    print(f"\nContent:\n{text_preview}\n")

# Save retrieved chunks to a markdown file for easy inspection
# with open("retriever.md", "w", encoding="utf-8") as f:
#     f.write(f"# Query: {query}\n\n")
#     f.write(f"# Retrieved {len(nodes)} chunks from openrouter.duckdb\n\n")
#     for i, node in enumerate(nodes, 1):
#         f.write(f"{'─'*60}\n")
#         f.write(f"## Chunk {i}\n\n")
#         if hasattr(node, 'score'):
#             f.write(f"**Similarity Score:** {node.score:.4f}\n\n")
#         if hasattr(node, 'metadata') and node.metadata:
#             f.write(f"**Metadata:**\n")
#             for key, value in node.metadata.items():
#                 f.write(f"- {key}: {value}\n")
#             f.write(f"\n")
#         f.write(f"{node.text}\n\n")
```




## Python Chroma

Now that our knowledge base is populated, we can test the retrieval system. We can ask a specific question, like "What are model variants?", and query the `Chroma` store to see which text chunks are most relevant. This confirms that our embeddings and search are working correctly.

### Question:What are model variants?

RAG result:

```{python}
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()

# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "qwen/qwen3-embedding-8b"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"

# Load existing vector store
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)

# Test query
query = "What are model variants?"

# Perform similarity search
results = vectorstore.similarity_search(query, k=5)

print(f"\nQuery: '{query}'")
print(f"Found {len(results)} relevant chunks:\n")

for i, doc in enumerate(results, 1):
    print(f"Result {i}:")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Content preview: {doc.page_content[:800]}...")
    
```

:::



# Chat with RAG

::: {.panel-tabset}

## R

The final piece is to connect this retrieval capability to a chat interface. We use `ellmer` to create a chat client. Crucially, we register a "retrieval tool" using `ragnar_register_tool_retrieve`. This gives the LLM the ability to query our vector store whenever it needs information to answer a user's question.

We also provide a system prompt that instructs the model to always check the knowledge base and cite its sources.


```{r}
library(ellmer)
library(dotenv)
load_dot_env(file = ".env")

chat <- chat_openrouter(
    api_key = Sys.getenv("OPENROUTER_API_KEY"),
    model = "openai/gpt-oss-120b",
    system_prompt = glue::trim("
  You are an assistant for question-answering tasks. You are concise.

  Before responding, retrieve relevant material from the knowledge store. Quote or
  paraphrase passages, clearly marking your own words versus the source. Provide a
  working link for every source cited, as well as any additional relevant links.
  Do not answer unless you have retrieved and cited a source.If you do not find
  relevant information, say 'I could not find anything relevant in the knowledge base
    ")
) |>
    ragnar_register_tool_retrieve(store, top_k = 3)


```

### Question:What are model variants?

```{r}
#| results: asis
chat$chat("What are model variants?")
```


## Python chatlas

We can also use the `chatlas` library to create a chat interface. Here, we define a custom tool `retrieve_trusted_content` that queries our DuckDB index. We then register this tool with the chat model, allowing it to pull in relevant information when answering user questions.

### Question:What are model variants?


```{python}
import os
from typing import Any, List
from openai import OpenAI
import chatlas as ctl
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from dotenv import load_dotenv
load_dotenv()
# Ensure API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")

# Custom OpenRouter Embedding class for LlamaIndex
class OpenRouterEmbedding(BaseEmbedding):
    """Custom embedding class for OpenRouter API compatible with LlamaIndex."""

    def __init__(
        self,
        api_key: str,
        model: str = "qwen/qwen3-embedding-8b",
        **kwargs: Any
    ):
        super().__init__(**kwargs)
        self._client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self._model = model

    def _get_query_embedding(self, query: str) -> List[float]:
        """Get embedding for a query string."""
        response = self._client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self._model,
            input=query,
            encoding_format="float"
        )
        return response.data[0].embedding

    def _get_text_embedding(self, text: str) -> List[float]:
        """Get embedding for a text string."""
        return self._get_query_embedding(text)

    async def _aget_query_embedding(self, query: str) -> List[float]:
        """Async version of get_query_embedding."""
        return self._get_query_embedding(query)

    async def _aget_text_embedding(self, text: str) -> List[float]:
        """Async version of get_text_embedding."""
        return self._get_text_embedding(text)

# 1. Configure Embedding Model using custom OpenRouter class
# (Note: must use the same embedding model that was used to create the database)
embed_model = OpenRouterEmbedding(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)
Settings.embed_model = embed_model

# 2. Load Vector Store
vector_store = DuckDBVectorStore(database_name="openrouter.duckdb",persist_dir="./persist/")
index = VectorStoreIndex.from_vector_store(vector_store)

# 3. Define Retrieval Tool
def retrieve_trusted_content(query: str, top_k: int = 8):
    """
    Retrieve relevant content from the knowledge store.

    Parameters
    ----------
    query
        The query used to semantically search the knowledge store.
    top_k
        The number of results to retrieve from the knowledge store.
    """
    #print(f"Retrieving content for query: '{query}'")
    retriever = index.as_retriever(similarity_top_k=top_k)
    nodes = retriever.retrieve(query)
    return [f"<excerpt>{x.text}</excerpt>" for x in nodes]

# 4. Initialize Chat with System Prompt
chat = ctl.ChatOpenRouter(
    model="openai/gpt-oss-120b",
    api_key=openrouter_api_key,
    base_url="https://openrouter.ai/api/v1",
    system_prompt=(
        "You are an assistant for question-answering tasks. "
        "Use the retrieve_trusted_content tool to find relevant information from the knowledge store. "
        "Answer questions based on the retrieved content. "
        "If you cannot find relevant information, say so clearly."
    )
)

chat.register_tool(retrieve_trusted_content)
```

```{python}
#retrieve_trusted_content("What are model variants?",top_k=3)
```


```{python}
#| output: false
response=chat.chat("What are model variants?", echo="none")
```

```{python}
print(response)
```



## Python LangChain

The final piece is to connect this retrieval capability to a chat interface. We use `LangChain` to create a RAG chain. We create a `retriever` from our vector store and combine it with a `ChatOpenAI` model (using OpenRouter) and a prompt template. This gives the LLM the ability to query our vector store whenever it needs information to answer a user's question.


### Question:What are model variants?


```{python}
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from openai import OpenAI
from langchain_core.embeddings import Embeddings
from langchain_chroma import Chroma
from typing import List
import os
from dotenv import load_dotenv

load_dotenv()

# Custom embeddings class for OpenRouter API
class OpenRouterEmbeddings(Embeddings):
    """Custom embeddings class for OpenRouter API."""
    
    def __init__(self, api_key: str, model: str = "qwen/qwen3-embedding-8b"):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=texts,
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query."""
        response = self.client.embeddings.create(
            extra_headers={
                "HTTP-Referer": "https://ai-blog.com",
                "X-Title": "AI Blog RAG",
            },
            model=self.model,
            input=text,
            encoding_format="float"
        )
        return response.data[0].embedding

# Get OpenRouter API key
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
if not openrouter_api_key:
    raise ValueError("OPENROUTER_API_KEY not found in environment variables")

# Create embeddings instance using OpenRouter
embeddings = OpenRouterEmbeddings(
    api_key=openrouter_api_key,
    model="qwen/qwen3-embedding-8b"
)

# Define vector store location
persist_directory = "chroma_db_data"

# Load existing vector store
print(f"Loading existing vectorstore from {persist_directory}...")
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)
print(f"✓ Loaded vectorstore")

# Initialize LLM using OpenRouter
llm = ChatOpenAI(
    model="openai/gpt-oss-120b",
    openai_api_key=os.getenv("OPENROUTER_API_KEY"),
    openai_api_base="https://openrouter.ai/api/v1"
)

# Create prompt template
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "Context: {context}"
    "\n\n"
    "Question: {question}"
)

prompt = ChatPromptTemplate.from_template(system_prompt)

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create RAG chain using LCEL
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print("✓ RAG chain created successfully!")

# Test the RAG chain
question = "What are model variants?"

print(f"\nQuestion: {question}")

# Get context documents separately for display
context_docs = retriever.invoke(question)

# Invoke the RAG chain
answer = rag_chain.invoke(question)
import textwrap

for line in answer.split('\n'):
    print(textwrap.fill(line, width=80))


```

:::
 
 

This blog is built with ❤️ and Quarto.