This tutorial demonstrates how to:
- Embed exercise science text with
nomic-embed-text
(v1.5) via Ollama. - Store embeddings in Chroma, a local vector database.
- Query the embeddings and use
deepseek-r1:1.5b
to generate answers, stripping any<think>
sections—all offline.
The dataset:
- "Strength training involves resistance exercises designed to increase muscle mass, improve muscular endurance, and enhance overall physical performance. Common methods include weightlifting with barbells or dumbbells, bodyweight exercises like push-ups, and resistance band workouts."
- "Cardiovascular exercise, such as running, cycling, or swimming, boosts heart health by improving circulation, increasing lung capacity, and reducing the risk of chronic diseases like hypertension and diabetes."
- "Stretching and mobility exercises, including yoga and dynamic warm-ups, enhance joint range of motion, reduce injury risk, and improve posture by counteracting the stiffness caused by sedentary lifestyles."
- "Post-exercise recovery is critical for performance gains. Techniques like foam rolling, adequate sleep, and proper nutrition—especially protein intake—help repair muscle fibers and reduce soreness."
- "High-Intensity Interval Training (HIIT) alternates short bursts of intense exercise with rest periods, maximizing calorie burn and improving aerobic capacity in less time than traditional steady-state cardio."
- System: Python 3.8+ with
pip
. - Ollama: Installed (from ollama.com, version 0.1.26+ recommended).
- Hardware: Decent CPU (GPU optional).
- Dependencies:
pip install ollama chromadb
- Start Ollama:
Runs at
ollama serve
http://localhost:11434
. - Pull Models:
- Embeddings:
ollama pull nomic-embed-text
- Language model:
Note: If
ollama pull deepseek-r1:1.5b
deepseek-r1:1.5b
isn’t in Ollama’s registry, confirm its exact name or load it via a customModelfile
.
- Embeddings:
Create generate_embeddings.py
:
import ollama
import chromadb
# Exercise science text corpus
documents = [
"Strength training involves resistance exercises designed to increase muscle mass, improve muscular endurance, and enhance overall physical performance. Common methods include weightlifting with barbells or dumbbells, bodyweight exercises like push-ups, and resistance band workouts.",
"Cardiovascular exercise, such as running, cycling, or swimming, boosts heart health by improving circulation, increasing lung capacity, and reducing the risk of chronic diseases like hypertension and diabetes.",
"Stretching and mobility exercises, including yoga and dynamic warm-ups, enhance joint range of motion, reduce injury risk, and improve posture by counteracting the stiffness caused by sedentary lifestyles.",
"Post-exercise recovery is critical for performance gains. Techniques like foam rolling, adequate sleep, and proper nutrition—especially protein intake—help repair muscle fibers and reduce soreness.",
"High-Intensity Interval Training (HIIT) alternates short bursts of intense exercise with rest periods, maximizing calorie burn and improving aerobic capacity in less time than traditional steady-state cardio."
]
# Initialize Chroma client
client = chromadb.PersistentClient(path="./exercise_db")
# Create or get a collection
collection = client.get_or_create_collection(name="exercise_science")
# Generate and store embeddings
for i, doc in enumerate(documents):
prefixed_doc = f"search_document: {doc}"
response = ollama.embed(model="nomic-embed-text", input=prefixed_doc)
embedding = response["embeddings"][0]
collection.add(
ids=[str(i)],
embeddings=[embedding],
documents=[doc]
)
print("Embeddings generated and stored in Chroma!")
Run it:
python generate_embeddings.py
- Output: "Embeddings generated and stored in Chroma!"
Create query_embeddings.py
with filtering for the <think>
section:
import ollama
import chromadb
# Initialize Chroma client
client = chromadb.PersistentClient(path="./exercise_db")
# Get the existing collection
collection = client.get_collection(name="exercise_science")
# Query example
query = "What exercises improve heart health?"
prefixed_query = f"search_query: {query}"
query_response = ollama.embed(model="nomic-embed-text", input=prefixed_query)
query_embedding = query_response["embeddings"][0]
# Search for the top match
results = collection.query(
query_embeddings=[query_embedding],
n_results=1
)
# Retrieve the top document
retrieved_doc = results["documents"][0][0]
# Generate answer with deepseek-r1:1.5b
prompt = f"Using this info: '{retrieved_doc}', answer: {query}"
answer = ollama.generate(model="deepseek-r1:1.5b", prompt=prompt)
# Remove <think> section from the response
response_text = answer["response"]
start_tag = "<think>"
end_tag = "</think>"
if start_tag in response_text and end_tag in response_text:
start_idx = response_text.index(start_tag)
end_idx = response_text.index(end_tag) + len(end_tag)
response_text = response_text[:start_idx] + response_text[end_idx:]
response_text = response_text.strip() # Clean up any extra whitespace
# Print results
print("\nQuery:", query)
print("Retrieved document:", retrieved_doc)
print("Generated answer:", response_text)
Run it:
python query_embeddings.py
- Expected Output (based on your sample response, without
<think>
):Query: What exercises improve heart health? Retrieved document: Cardiovascular exercise, such as running, cycling, or swimming, boosts heart health by improving circulation, increasing lung capacity, and reducing the risk of chronic diseases like hypertension and diabetes. Generated answer: The provided information highlights several cardiovascular exercises that improve heart health by enhancing circulation, lung capacity, and reducing the risk of chronic diseases like hypertension and diabetes. Here is a comprehensive list of exercises that support cardiovascular health: 1. **Running**: Enhances circulation, improves lung capacity, and contributes to overall cardiovascular function. 2. **Cycling**: Improves blood flow, increases lung capacity, and reduces the likelihood of heart disease and related conditions. 3. **Swimming**: Boosts cardiovascular efficiency, enhances lung expansion, and mitigates the risk of chronic diseases. 4. **Yoga and Tai Chi**: These exercises target multiple bodies, improving circulation, strength, flexibility, and overall health without relying solely on exercise alone. 5. **Freestyle Swimming**: Further increases heart rate and blood flow through efficient stroke technique. 6. **Push-Ups**: While primarily a strength training exercise, they can improve cardiovascular fitness by enhancing circulation. These exercises collectively support heart health by addressing circulatory and lung-related systems.
- Reason: Chroma is lightweight, open-source (Apache 2.0), and local, storing embeddings in
./exercise_db
. - Pros: Simple, no cloud needed, integrates with Ollama.
- Alternatives: FAISS (faster for scale) or pgvector (SQL-based), but Chroma suits this setup.
- More Results: Set
n_results=3
, combine documents:retrieved_docs = " ".join(results["documents"][0]) prompt = f"Using this info: '{retrieved_docs}', answer: {query}"
- Prompt Tuning: Adjust prompt for concise answers, e.g., "Summarize how '{retrieved_doc}' answers: {query}."
- Custom Queries: Edit
query
and rerun.
- Model Check: Run
ollama list
to confirmdeepseek-r1:1.5b
is available. If not, verify its name or load it manually. - Ollama Running: Ensure
ollama serve
is active. - Response Format: If
<think>
tags vary, adjust the filtering logic (e.g., use regex). - Empty Collection: Add
print(collection.count())
. If 0, rerungenerate_embeddings.py
.
<think>
Removal: The script assumes<think>
and</think>
wrap the reasoning. If the format changes, you might need a more robust parser (e.g.,re.sub(r'<think>.*?</think>', '', response_text, flags=re.DOTALL)
withimport re
).- Deepseek Behavior:
deepseek-r1:1.5b
includes reasoning steps. If this persists unwantedly, you could tweak the prompt to discourage it (e.g., "Answer directly without reasoning: ...").