ai-for-developersai-model-deploymentai-tutorialsamazon-sagemakeraws-ai

Hands-On Guide: Embeddings, LLMs & RAG with SageMaker Studio

2025-03-31Watch on YouTube ↗

Watch the full video on YouTube( Note: To complete this lab, you need two instances of ml.g5.2xlarge which is not

Use with AI

ShareX LinkedIn

This lab costs about $10 in AWS compute - two ml.g5.2xlarge instances running for two hours. It is not free tier eligible. Go in knowing that upfront so you're not surprised when the bill arrives.

What you get for that $10: a working RAG pipeline on AWS, hands-on experience deploying real embedding and LLM endpoints via SageMaker JumpStart, and a notebook you can adapt for your own documents.

What You'll Build

A retrieval-augmented generation system using:

Embedding model: BGE Small En V1.5 (deployed via SageMaker JumpStart)
LLM: Mistral 7B Instruct (deployed via SageMaker JumpStart)
Vector store: FAISS (in-memory, runs in the notebook)

The RAG flow: embed your documents → store vectors in FAISS → at query time, embed the question → retrieve top-N relevant chunks → build a context prompt → call the Mistral endpoint for a grounded answer.

Lab Structure

Step 1 - Deploy the BGE Embedding Model

Open SageMaker Studio and navigate to JumpStart. Search for "BGE Small En V1.5" and deploy it to an ml.g5.2xlarge instance. This model converts text into dense vector representations you can compare for similarity.

Wait for the endpoint status to show "InService" before moving on - deployment takes a few minutes.

Step 2 - Deploy Mistral 7B Instruct

Back in JumpStart, search for "Mistral 7B Instruct" and deploy it to a second ml.g5.2xlarge instance. This is the generation model - it takes a context-stuffed prompt and returns a grounded answer.

Two endpoints, two instances, two separate costs ticking from this point.

Step 3 - Clone the Workshop Repo and Run the RAG Notebook

Clone the workshop repository into SageMaker Studio:

git clone https://github.com/abhisodhani/sagemaker-workshop-cloud-seminar.git

Open the RAG notebook. It walks through two sub-steps:

Step 3a - Index documents into FAISS

The notebook loads sample documents, calls your BGE endpoint to embed each chunk, and stores the vectors in a FAISS index. FAISS runs entirely in memory - no external database needed for this lab.

Step 3b - Retrieve, prompt, and generate

At query time:

The question gets embedded via the BGE endpoint
FAISS returns the top-N most similar document chunks
Those chunks are injected into a prompt template
The prompt goes to Mistral, which generates a grounded answer referencing only the retrieved context

This is the core RAG loop. Understanding it here makes every hosted RAG service (Bedrock Knowledge Bases, Azure AI Search, etc.) easier to reason about - they're all variations on this same pattern.

Code and Resources

My annotated version of the lab code is on GitHub:

github.com/pagand/upaspro - aws_sagemaker_rag

Original workshop repo:

github.com/abhisodhani/sagemaker-workshop-cloud-seminar

Cost Control

Stop both endpoints immediately after the lab. SageMaker endpoints bill by the hour even when idle. Go to Inference → Endpoints in the SageMaker console, select each endpoint, and delete it. Don't just close the notebook.

If you want to experiment further without burning money, consider SageMaker's serverless inference for the embedding model - it scales to zero between calls, which is fine for low-throughput experimentation.

Want to go deeper?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call - no pitch, just a focused conversation about your situation.

Book a strategy call →Download the checklist →

I make videos like this when I have something worth explaining. Join AI Command Room and I'll let you know when the next one ships.