diff --git a/evaluation-observe/bedrock-llm-as-judge-evaluation/model-as-a-judge.ipynb b/evaluation-observe/bedrock-llm-as-judge-evaluation/model-as-a-judge.ipynb new file mode 100644 index 00000000..929ba10c --- /dev/null +++ b/evaluation-observe/bedrock-llm-as-judge-evaluation/model-as-a-judge.ipynb @@ -0,0 +1,1203 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Amazon Bedrock Model-as-a-Judge Evaluation Guide\n", + "\n", + "## Introduction\n", + "\n", + "This notebook demonstrates how to use Amazon Bedrock's Model-as-a-Judge feature for systematic model evaluation. The Model-as-a-Judge approach uses a foundation model to score another model's responses and provide explanations for the scores. The guide covers creating evaluation datasets, running evaluations, and comparing different foundation models.\n", + "\n", + "## Contents\n", + "\n", + "1. [Setup and Configuration](#setup)\n", + "2. [Dataset Generation](#dataset)\n", + "3. [S3 Integration](#s3)\n", + "4. [Single Model Evaluation](#single)\n", + "5. [Model Selection and Comparison](#comparison)\n", + "6. [Monitoring and Results](#monitoring)\n", + "\n", + "## Prerequisites\n", + "\n", + "- An AWS account with Bedrock access\n", + "- Appropriate IAM roles and permissions\n", + "- Access to supported evaluator models (Claude 3 Haiku, Claude 3.5 Sonnet, Mistral Large, or Meta Llama 3.1)\n", + "- An S3 bucket for storing evaluation data\n", + "\n", + "Let's begin with updating boto3 to latest version" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install boto3 --upgrade" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import boto3\n", + "import json\n", + "import random\n", + "from datetime import datetime\n", + "from typing import List, Dict, Any, Optional\n", + "\n", + "# AWS Configuration\n", + "REGION = \"\"\n", + "ROLE_ARN = \"arn:aws:iam:::role/\"\n", + "BUCKET_NAME = \"\"\n", + "PREFIX = \"\"\n", + "dataset_custom_name = \"dummy-data\"\n", + "\n", + "# Initialize AWS clients\n", + "bedrock_client = boto3.client('bedrock', region_name=REGION)\n", + "s3_client = boto3.client('s3', region_name=REGION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Generation \n", + "\n", + "We'll create a simple dataset of mathematical reasoning problems. These problems test:\n", + "- Basic arithmetic\n", + "- Logical reasoning\n", + "- Natural language understanding\n", + "\n", + "The dataset follows the required JSONL format for Bedrock evaluation jobs." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import random\n", + "import json\n", + "\n", + "def generate_shopping_problems(num_problems=50):\n", + " \"\"\"Generate shopping-related math problems with random values.\"\"\"\n", + " problems = []\n", + " items = [\"apples\", \"oranges\", \"bananas\", \"books\", \"pencils\", \"notebooks\"]\n", + " \n", + " for _ in range(num_problems):\n", + " # Generate random values\n", + " item = random.choice(items)\n", + " quantity = random.randint(3, 20)\n", + " price_per_item = round(random.uniform(1.5, 15.0), 2)\n", + " discount_percent = random.choice([10, 15, 20, 25, 30])\n", + " \n", + " # Calculate the answer\n", + " total_price = quantity * price_per_item\n", + " discount_amount = total_price * (discount_percent / 100)\n", + " final_price = round(total_price - discount_amount, 2)\n", + " \n", + " # Create the problem\n", + " problem = {\n", + " \"prompt\": f\"If {item} cost \\${price_per_item} each and you buy {quantity} of them with a {discount_percent}% discount, how much will you pay in total?\",\n", + " \"category\": \"Shopping Math\",\n", + " \"referenceResponse\": f\"The total price will be \\${final_price}. Original price: \\${total_price} minus {discount_percent}% discount (\\${discount_amount})\"\n", + " }\n", + " \n", + " problems.append(problem)\n", + " \n", + " return problems\n", + "\n", + "def save_to_jsonl(problems, output_file):\n", + " \"\"\"Save the problems to a JSONL file.\"\"\"\n", + " with open(output_file, 'w') as f:\n", + " for problem in problems:\n", + " f.write(json.dumps(problem) + '\\n')\n", + "\n", + "SAMPLE_SIZE = 30\n", + "problems = generate_shopping_problems(SAMPLE_SIZE)\n", + "save_to_jsonl(problems, f\"{dataset_custom_name}.jsonl\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## S3 Integration \n", + "\n", + "After generating our sample dataset, we need to upload it to S3 for use in the evaluation job. \n", + "We'll use the boto3 S3 client to upload our JSONL file.\n", + "\n", + "> **Note**: Make sure your IAM role has appropriate S3 permissions (s3:PutObject) for the target bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def upload_to_s3(local_file: str, bucket: str, s3_key: str) -> bool:\n", + " \"\"\"\n", + " Upload a file to S3 with error handling.\n", + " \n", + " Returns:\n", + " bool: Success status\n", + " \"\"\"\n", + " try:\n", + " s3_client.upload_file(local_file, bucket, s3_key)\n", + " print(f\"✓ Successfully uploaded to s3://{bucket}/{s3_key}\")\n", + " return True\n", + " except Exception as e:\n", + " print(f\"✗ Error uploading to S3: {str(e)}\")\n", + " return False\n", + "\n", + "# Upload dataset\n", + "s3_key = f\"{PREFIX}/{dataset_custom_name}.jsonl\"\n", + "upload_success = upload_to_s3(f\"{dataset_custom_name}.jsonl\", BUCKET_NAME, s3_key)\n", + "\n", + "if not upload_success:\n", + " raise Exception(\"Failed to upload dataset to S3\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evaluation Job Configuration\n", + "\n", + "Configure the LLM-as-Judge evaluation with comprehensive metrics for assessing model performance:\n", + "\n", + "| Metric Category | Description |\n", + "|----------------|-------------|\n", + "| Quality | Correctness, Completeness, Faithfulness |\n", + "| User Experience | Helpfulness, Coherence, Relevance |\n", + "| Instructions | Following Instructions, Professional Style |\n", + "| Safety | Harmfulness, Stereotyping, Refusal |" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def create_llm_judge_evaluation(\n", + " client,\n", + " job_name: str,\n", + " role_arn: str,\n", + " input_s3_uri: str,\n", + " output_s3_uri: str,\n", + " evaluator_model_id: str,\n", + " generator_model_id: str,\n", + " dataset_name: str = None,\n", + " task_type: str = \"General\" # must be General for LLMaaJ\n", + "): \n", + " # All available LLM-as-judge metrics\n", + " llm_judge_metrics = [\n", + " \"Builtin.Correctness\",\n", + " \"Builtin.Completeness\", \n", + " \"Builtin.Faithfulness\",\n", + " \"Builtin.Helpfulness\",\n", + " \"Builtin.Coherence\",\n", + " \"Builtin.Relevance\",\n", + " \"Builtin.FollowingInstructions\",\n", + " \"Builtin.ProfessionalStyleAndTone\",\n", + " \"Builtin.Harmfulness\",\n", + " \"Builtin.Stereotyping\",\n", + " \"Builtin.Refusal\"\n", + " ]\n", + "\n", + " # Configure dataset\n", + " dataset_config = {\n", + " \"name\": dataset_name or \"CustomDataset\",\n", + " \"datasetLocation\": {\n", + " \"s3Uri\": input_s3_uri\n", + " }\n", + " }\n", + "\n", + " try:\n", + " response = client.create_evaluation_job(\n", + " jobName=job_name,\n", + " roleArn=role_arn,\n", + " applicationType=\"ModelEvaluation\",\n", + " evaluationConfig={\n", + " \"automated\": {\n", + " \"datasetMetricConfigs\": [\n", + " {\n", + " \"taskType\": task_type,\n", + " \"dataset\": dataset_config,\n", + " \"metricNames\": llm_judge_metrics\n", + " }\n", + " ],\n", + " \"evaluatorModelConfig\": {\n", + " \"bedrockEvaluatorModels\": [\n", + " {\n", + " \"modelIdentifier\": evaluator_model_id\n", + " }\n", + " ]\n", + " }\n", + " }\n", + " },\n", + " inferenceConfig={\n", + " \"models\": [\n", + " {\n", + " \"bedrockModel\": {\n", + " \"modelIdentifier\": generator_model_id\n", + " }\n", + " }\n", + " ]\n", + " },\n", + " outputDataConfig={\n", + " \"s3Uri\": output_s3_uri\n", + " }\n", + " )\n", + " return response\n", + " \n", + " except Exception as e:\n", + " print(f\"Error creating evaluation job: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Single Model Evaluation \n", + "\n", + "First, let's run a single evaluation job using Claude 3 Haiku as both generator and evaluator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Job Configuration\n", + "evaluator_model = \"anthropic.claude-3-haiku-20240307-v1:0\"\n", + "generator_model = \"anthropic.claude-3-haiku-20240307-v1:0\"\n", + "job_name = f\"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}\"\n", + "\n", + "# S3 Paths\n", + "input_data = f\"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl\"\n", + "output_path = f\"s3://{BUCKET_NAME}/{PREFIX}\"\n", + "\n", + "# Create evaluation job\n", + "try:\n", + " llm_as_judge_response = create_llm_judge_evaluation(\n", + " client=bedrock_client,\n", + " job_name=job_name,\n", + " role_arn=ROLE_ARN,\n", + " input_s3_uri=input_data,\n", + " output_s3_uri=output_path,\n", + " evaluator_model_id=evaluator_model,\n", + " generator_model_id=generator_model,\n", + " task_type=\"General\"\n", + " )\n", + " print(f\"✓ Created evaluation job: {llm_as_judge_response['jobArn']}\")\n", + "except Exception as e:\n", + " print(f\"✗ Failed to create evaluation job: {str(e)}\")\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Monitoring Job Progress\n", + "Track the status of your evaluation job:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Get job ARN based on job type\n", + "evaluation_job_arn = llm_as_judge_response['jobArn']\n", + "\n", + "# Check job status\n", + "check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) \n", + "print(f\"Job Status: {check_status['status']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Selection and Comparison \n", + "\n", + "Now, let's evaluate multiple generator models to find the optimal model for our use case. We'll compare different foundation models while using a consistent evaluator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Available Generator Models\n", + "GENERATOR_MODELS = [\n", + " \"anthropic.claude-3-haiku-20240307-v1:0\",\n", + " \"amazon.nova-micro-v1:0\"\n", + "]\n", + "\n", + "# Consistent Evaluator\n", + "EVALUATOR_MODEL = \"anthropic.claude-3-haiku-20240307-v1:0\"\n", + "\n", + "def run_model_comparison(\n", + " generator_models: List[str],\n", + " evaluator_model: str\n", + ") -> List[Dict[str, Any]]:\n", + " evaluation_jobs = []\n", + " \n", + " for generator_model in generator_models:\n", + " job_name = f\"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}\"\n", + " \n", + " try:\n", + " response = create_llm_judge_evaluation(\n", + " client=bedrock_client,\n", + " job_name=job_name,\n", + " role_arn=ROLE_ARN,\n", + " input_s3_uri=input_data,\n", + " output_s3_uri=f\"{output_path}/{job_name}/\",\n", + " evaluator_model_id=evaluator_model,\n", + " generator_model_id=generator_model,\n", + " task_type=\"General\"\n", + " )\n", + " \n", + " job_info = {\n", + " \"job_name\": job_name,\n", + " \"job_arn\": response[\"jobArn\"],\n", + " \"generator_model\": generator_model,\n", + " \"evaluator_model\": evaluator_model,\n", + " \"status\": \"CREATED\"\n", + " }\n", + " evaluation_jobs.append(job_info)\n", + " \n", + " print(f\"✓ Created job: {job_name}\")\n", + " print(f\" Generator: {generator_model}\")\n", + " print(f\" Evaluator: {evaluator_model}\")\n", + " print(\"-\" * 80)\n", + " \n", + " except Exception as e:\n", + " print(f\"✗ Error with {generator_model}: {str(e)}\")\n", + " continue\n", + " \n", + " return evaluation_jobs\n", + "\n", + "# Run model comparison\n", + "evaluation_jobs = run_model_comparison(GENERATOR_MODELS, EVALUATOR_MODEL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Monitoring and Results \n", + "\n", + "Track the progress of all evaluation jobs and display their current status." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# function to check job status\n", + "def check_jobs_status(jobs, client):\n", + " \"\"\"Check and update status for all evaluation jobs\"\"\"\n", + " for job in jobs:\n", + " try:\n", + " response = client.get_evaluation_job(\n", + " jobIdentifier=job[\"job_arn\"]\n", + " )\n", + " job[\"status\"] = response[\"status\"]\n", + " except Exception as e:\n", + " job[\"status\"] = f\"ERROR: {str(e)}\"\n", + " \n", + " return jobs\n", + "\n", + "# Check initial status\n", + "updated_jobs = check_jobs_status(evaluation_jobs, bedrock_client)\n", + "\n", + "# Display status summary\n", + "for job in updated_jobs:\n", + " print(f\"Job: {job['job_name']}\")\n", + " print(f\"Status: {job['status']}\")\n", + " print(f\"Generator: {job['generator_model']}\")\n", + " print(f\"Evaluator: {job['evaluator_model']}\")\n", + " print(\"-\" * 80)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Spearman's Correlation Analysis Between Multiple Generator Models\n", + "\n", + "* To calculate the Spearman's rank correlation between generator models, first read the evaluation results from S3 using the path structure:\n", + "```s3://[output-path]/[job-name]/[job-uuid]/models/[model-id]/taskTypes/[task-type]/datasets/dataset/[file-uuid]_output.jsonl```\n", + "- Each file contains evaluation scores across different metrics (Correctness, Completeness, Helpfulness, Coherence, and Faithfulness).\n", + "\n", + "* Use scipy.stats to compute the correlation coefficient between pairs of generator models, filtering out any constant values or error messages. \n", + "\n", + "* The resulting correlation matrix helps identify which models produce similar outputs and where they differ significantly in their response patterns. Higher correlation coefficients (closer to 1.0) indicate stronger agreement between models' responses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import boto3\n", + "import numpy as np\n", + "from scipy import stats\n", + "\n", + "def read_and_organize_metrics_from_s3(bucket_name, file_key):\n", + " s3_client = boto3.client('s3')\n", + " metrics_dict = {}\n", + " \n", + " try:\n", + " response = s3_client.get_object(Bucket=bucket_name, Key=file_key)\n", + " content = response['Body'].read().decode('utf-8')\n", + " \n", + " for line in content.strip().split('\\n'):\n", + " if line:\n", + " data = json.loads(line)\n", + " if 'automatedEvaluationResult' in data and 'scores' in data['automatedEvaluationResult']:\n", + " for score in data['automatedEvaluationResult']['scores']:\n", + " metric_name = score['metricName']\n", + " if 'result' in score:\n", + " metric_value = score['result']\n", + " if metric_name not in metrics_dict:\n", + " metrics_dict[metric_name] = []\n", + " metrics_dict[metric_name].append(metric_value)\n", + " return metrics_dict\n", + " \n", + " except Exception as e:\n", + " print(f\"Error: {e}\")\n", + " return None\n", + "\n", + "def get_spearmanr_correlation(scores1, scores2):\n", + " if len(set(scores1)) == 1 or len(set(scores2)) == 1:\n", + " return \"undefined (constant scores)\", \"undefined\"\n", + " \n", + " try:\n", + " result = stats.spearmanr(scores1, scores2)\n", + " return round(float(result.statistic), 4), round(float(result.pvalue), 4)\n", + " except Exception as e:\n", + " return f\"error: {str(e)}\", \"undefined\"\n", + "\n", + "# Extract metrics\n", + "bucket_name = \"\"\n", + "file_key1 = \"\"\n", + "file_key2 = \"\"\n", + "\n", + "metrics1 = read_and_organize_metrics_from_s3(bucket_name, file_key1)\n", + "metrics2 = read_and_organize_metrics_from_s3(bucket_name, file_key2)\n", + "\n", + "# Calculate correlations for common metrics\n", + "common_metrics = set(metrics1.keys()) & set(metrics2.keys())\n", + "\n", + "for metric_name in common_metrics:\n", + " scores1 = metrics1[metric_name]\n", + " scores2 = metrics2[metric_name]\n", + " \n", + " if len(scores1) == len(scores2):\n", + " correlation, p_value = get_spearmanr_correlation(scores1, scores2)\n", + " \n", + " print(f\"\\nMetric: {metric_name}\")\n", + " print(f\"Number of samples: {len(scores1)}\")\n", + " print(f\"Unique values in Model 1 scores: {len(set(scores1))}\")\n", + " print(f\"Unique values in Model 2 scores: {len(set(scores2))}\")\n", + " print(f\"Model 1 scores range: [{min(scores1)}, {max(scores1)}]\")\n", + " print(f\"Model 2 scores range: [{min(scores2)}, {max(scores2)}]\")\n", + " print(f\"Spearman correlation coefficient: {correlation}\")\n", + " print(f\"P-value: {p_value}\")\n", + " else:\n", + " print(f\"\\nMetric: {metric_name}\")\n", + " print(\"Error: Different number of samples between models\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next Steps\n", + "\n", + "After running the evaluation job:\n", + "1. Monitor the job status in the Bedrock console or through `get_evaluation_job` API\n", + "2. Review the report card for:\n", + " - Score distributions across different metrics\n", + " - Detailed explanations for scoring provided by the judge model\n", + " - Overall performance analysis\n", + "3. Access full results in your specified S3 bucket\n", + "\n", + "> **Note**: The evaluation results will help you understand your model's strengths and areas for improvement across multiple dimensions of performance." + ] + } + ], + "metadata": { + "availableInstances": [ + { + "_defaultOrder": 0, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.t3.medium", + "vcpuNum": 2 + }, + { + "_defaultOrder": 1, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.t3.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 2, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.t3.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 3, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.t3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 4, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 5, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 6, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 7, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 8, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 9, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 10, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 11, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 12, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5d.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 13, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5d.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 14, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5d.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 15, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5d.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 16, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5d.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 17, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5d.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 18, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5d.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 19, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 20, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": true, + "memoryGiB": 0, + "name": "ml.geospatial.interactive", + "supportedImageNames": [ + "sagemaker-geospatial-v1-0" + ], + "vcpuNum": 0 + }, + { + "_defaultOrder": 21, + "_isFastLaunch": true, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.c5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 22, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.c5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 23, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.c5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 24, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.c5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 25, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 72, + "name": "ml.c5.9xlarge", + "vcpuNum": 36 + }, + { + "_defaultOrder": 26, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 96, + "name": "ml.c5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 27, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 144, + "name": "ml.c5.18xlarge", + "vcpuNum": 72 + }, + { + "_defaultOrder": 28, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.c5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 29, + "_isFastLaunch": true, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g4dn.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 30, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g4dn.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 31, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g4dn.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 32, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g4dn.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 33, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g4dn.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 34, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g4dn.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 35, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 61, + "name": "ml.p3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 36, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 244, + "name": "ml.p3.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 37, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 488, + "name": "ml.p3.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 38, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.p3dn.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 39, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.r5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 40, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.r5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 41, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.r5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 42, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.r5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 43, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.r5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 44, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.r5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 45, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.r5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 46, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.r5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 47, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 48, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 49, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 50, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 51, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 52, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 53, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.g5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 54, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.g5.48xlarge", + "vcpuNum": 192 + }, + { + "_defaultOrder": 55, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 56, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4de.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 57, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.trn1.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 58, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1.32xlarge", + "vcpuNum": 128 + }, + { + "_defaultOrder": 59, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1n.32xlarge", + "vcpuNum": 128 + } + ], + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/evaluation-observe/bedrock-rag-evaluation/dummy-dataset-preparation.ipynb b/evaluation-observe/bedrock-rag-evaluation/dummy-dataset-preparation.ipynb new file mode 100644 index 00000000..dc1e2dc9 --- /dev/null +++ b/evaluation-observe/bedrock-rag-evaluation/dummy-dataset-preparation.ipynb @@ -0,0 +1,1161 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Synthetic Q&A Dataset Generator for RAG Evaluation\n", + "\n", + "This script automates the generation of synthetic question-answer pairs from PDF documents for evaluating Retrieval-Augmented Generation (RAG) systems. It uses LangChain with Amazon Bedrock's LLama2 model to:\n", + "- Extract meaningful chunks from PDF documents\n", + "- Generate relevant questions based on the content\n", + "- Create corresponding answers and identify source contexts\n", + "- Output the data in two formats: prompt-only and prompt-with-ground-truth\n", + "- Perform quality checks to ensure valid content\n", + "\n", + "## Prerequisites\n", + "- Amazon Bedrock access with LLama2 model enabled\n", + "- Python 3.8+\n", + "- PDF documents in a specified directory\n", + "- Required packages: langchain, boto3, pandas, tqdm\n", + "\n", + "## Import Required Libraries\n", + "\n", + "These libraries handle PDF processing, AWS integration, data manipulation, and progress tracking." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#%pip install langchain langchain-aws langchain-community pypdf --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Add S3 Bucket\n", + "\n", + "Before we proceed, lets add the S3 bucket name where you have enabled `CORS` and have permission to use. This dummy dataset will be uploaded in the S3 bucket and it will also be used by Evaluation job.\n", + "\n", + "Check `CORS` requirements on our [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-security-cors.html) page." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "bucket_name = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.document_loaders.pdf import PyPDFDirectoryLoader\n", + "import json\n", + "import boto3\n", + "from langchain_community.chat_models import BedrockChat\n", + "from langchain.prompts import PromptTemplate\n", + "import pandas as pd\n", + "from tqdm import tqdm\n", + "import os\n", + "import shutil" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load and Process PDF Documents\n", + "\n", + "This section loads PDF documents and splits them into manageable chunks for processing. The RecursiveCharacterTextSplitter ensures context-aware splitting with overlap to maintain coherence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "working_dir = os.getcwd()\n", + "\n", + "print(\"Current Working Directory:\", working_dir)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "loader = PyPDFDirectoryLoader(f\"{working_dir}/synthetic_data\") \n", + "documents = loader.load()\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size = 2500, \n", + " chunk_overlap = 100,\n", + " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n", + ")\n", + "\n", + "docs = text_splitter.split_documents(documents)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "len(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configure Amazon Bedrock\n", + "\n", + "Sets up the connection to Amazon Bedrock and configures the LLama2 model with appropriate parameters for consistent output generation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "boto3_bedrock = boto3.client('bedrock-runtime',region_name='us-east-1')\n", + "llama_3_70B = \"meta.llama3-70b-instruct-v1:0\"\n", + "inference_modifier_llama = {\n", + " \"max_gen_len\": 4096,\n", + " \"temperature\": 0.5,\n", + "}\n", + "\n", + "llm = BedrockChat(\n", + " model_id = llama_3_70B,\n", + " client = boto3_bedrock, \n", + " model_kwargs = inference_modifier_llama \n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Prompt Templates\n", + "\n", + "These templates guide the LLM in generating questions, answers, and identifying relevant context. Each template is carefully structured to ensure:\n", + "- Questions are meaningful and answerable\n", + "- Answers are precise and based on context\n", + "- Source contexts are accurately extracted" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "initial_question_prompt_template = PromptTemplate(\n", + " input_variables=[\"context\"],\n", + " template=\"\"\"\n", + " [INST]\n", + " \n", + " Here is some context:\n", + " \n", + " {context}\n", + " \n", + "\n", + " Your task is to generate 1 question that can be answered using the provided context, following these rules:\n", + "\n", + " \n", + " 1. The question should make sense to humans even when read without the given context.\n", + " 2. The question should be fully answered from the given context.\n", + " 3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.\n", + " 4. The answer to the question should not contain any links.\n", + " 5. The question should be of moderate difficulty.\n", + " 6. The question must be reasonable and must be understood and responded by humans.\n", + " 7. Do not use phrases like 'provided context', etc. in the question.\n", + " 8. Avoid framing questions using the word \"and\" that can be decomposed into more than one question.\n", + " 9. The question should not contain more than 10 words, make use of abbreviations wherever possible.\n", + " \n", + "\n", + " Output only the generated question with a \"?\" at the end, no other text or characters.\n", + " \n", + " [/INST]\n", + " \"\"\")\n", + "\n", + "answer_prompt_template = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=\"\"\"\n", + " [INST]\n", + " \n", + " \n", + " You are an experienced QA Engineer for building large language model applications.\n", + " It is your task to generate an answer to the following question {question} only based on the {context}\n", + " The output should be only the answer generated from the context.\n", + "\n", + " \n", + " 1. Only use the given context as a source for generating the answer.\n", + " 2. Be as precise as possible with answering the question.\n", + " 3. Be concise in answering the question and only answer the question at hand rather than adding extra information.\n", + " \n", + "\n", + " Only output the generated answer as a sentence. No extra characters.\n", + " \n", + " \n", + " [/INST]\n", + " Assistant:\n", + " \"\"\")\n", + "\n", + "source_prompt_template = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=\"\"\"Human:\n", + " [INST]\n", + " \n", + " Here is the context:\n", + " \n", + " {context}\n", + " \n", + "\n", + " Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.\n", + "\n", + " \n", + " {question}\n", + " \n", + "\n", + " Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.\n", + " \n", + " [/INST]\n", + " Assistant:\n", + " \"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Helper Functions\n", + "\n", + "These core functions handle the interaction with the LLM to generate questions, answers, and extract relevant source contexts." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def generate_question(doc, llm):\n", + " initial_question_prompt = initial_question_prompt_template.format(context=doc)\n", + " initial_question = llm.invoke(initial_question_prompt)\n", + " return initial_question\n", + "\n", + "def generate_answer(question: str, doc, llm):\n", + " answer_prompt = answer_prompt_template.format(question = question, context=doc)\n", + " answer = llm.invoke(answer_prompt)\n", + " return answer\n", + "\n", + "def generate_source(question: str, doc, llm):\n", + " source_prompt = source_prompt_template.format(question = question, context=doc)\n", + " source = llm.invoke(source_prompt)\n", + " return source" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Dataset Generation Functions\n", + "\n", + "These functions orchestrate the QA pair generation process, managing the creation and storage of questions, answers, and contexts in a structured format." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def generate_qa_dataset_doc(doc, llm, dataset, doc_number):\n", + " question = generate_question(doc, llm)\n", + " dataset.at[doc_number, \"question\"] = question.content\n", + " \n", + " answer = generate_answer(question, doc, llm)\n", + " dataset.at[doc_number, \"reference_answer\"] = answer.content\n", + " \n", + " source_sentence = generate_source(question, doc, llm)\n", + " dataset.at[doc_number, \"source_sentence\"] = source_sentence.content\n", + " \n", + " dataset.at[doc_number, \"source_raw\"] = doc.page_content\n", + " dataset.at[doc_number, \"source_document\"] = doc.metadata[\"source\"]\n", + " \n", + " return dataset\n", + "\n", + "def generate_dataset(documents, llm, dataset):\n", + " for doc in tqdm(range(len(documents))):\n", + " dataset = generate_qa_dataset_doc(doc = documents[doc], llm = llm, dataset = dataset, doc_number = doc)\n", + " return dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Schema Conversion Functions\n", + "\n", + "These functions handle data validation and conversion into two specific JSON schemas:\n", + "- prompt_only: Contains just the question for evaluation\n", + "- prompt_with_gt: Contains question, reference answer, and contexts\n", + "The functions include quality checks to ensure no empty or invalid content makes it to the final output." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def is_valid_content(text):\n", + " return bool(text and text.strip())\n", + "\n", + "def convert_schema(example, schema_type=\"prompt_only\"):\n", + " if not is_valid_content(example[\"query\"]):\n", + " return None\n", + " \n", + " query = example[\"query\"].strip()\n", + "\n", + " if schema_type == \"prompt_only\":\n", + " new_schema = {\n", + " \"conversationTurns\": [\n", + " {\n", + " \"prompt\": {\n", + " \"content\": [{\"text\": query}]\n", + " }\n", + " }\n", + " ]\n", + " }\n", + " elif schema_type == \"prompt_with_gt\":\n", + " reference_answer = example[\"reference_answer\"].strip()\n", + " if not (is_valid_content(reference_answer) and example[\"reference_contexts\"]):\n", + " return None\n", + " \n", + " valid_contexts = [\n", + " context.strip() for context in example[\"reference_contexts\"] \n", + " if is_valid_content(context)\n", + " ]\n", + " \n", + " if not valid_contexts:\n", + " return None\n", + "\n", + " new_schema = {\n", + " \"conversationTurns\": [\n", + " {\n", + " \"prompt\": {\n", + " \"content\": [{\"text\": query}]\n", + " },\n", + " \"referenceResponses\": [\n", + " {\"content\": [{\"text\": reference_answer}]}\n", + " ],\n", + " \"referenceContexts\": [\n", + " {\"content\": [{\"text\": context}]} for context in valid_contexts\n", + " ]\n", + " }\n", + " ]\n", + " }\n", + " else:\n", + " raise ValueError(f\"Invalid schema_type: {schema_type}. Must be either 'prompt_only' or 'prompt_with_gt'\")\n", + " return new_schema\n", + "\n", + "def save_to_jsonl(df, output_file_prefix, schema_type):\n", + " valid_records = 0\n", + " skipped_records = 0\n", + " \n", + " with open(f'{output_file_prefix}_{schema_type}.jsonl', 'w') as file:\n", + " for _, row in df.iterrows():\n", + " example = {\n", + " \"query\": row[\"query\"],\n", + " \"query_by\": {\"model_name\": row[\"model_name\"], \"type\": row[\"type\"]},\n", + " \"reference_contexts\": row[\"reference_contexts\"].split(\", \"),\n", + " \"reference_answer\": row[\"reference_answer\"],\n", + " \"reference_answer_by\": {\"model_name\": row[\"model_name\"], \"type\": row[\"type\"]}\n", + " }\n", + " \n", + " schema = convert_schema(example, schema_type)\n", + " if schema:\n", + " json.dump(schema, file)\n", + " file.write('\\n')\n", + " valid_records += 1\n", + " else:\n", + " skipped_records += 1\n", + " \n", + " print(f\"Schema type: {schema_type}\")\n", + " print(f\"Valid records written: {valid_records}\")\n", + " print(f\"Skipped records: {skipped_records}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate Dataset\n", + "\n", + "Initializes the dataset generation process with a subset of documents for testing or full processing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "docs_subset = docs[:20]\n", + "dataset = pd.DataFrame(columns=[\"question\", \"reference_answer\", \"source_sentence\",\"source_raw\",\"source_document\"])\n", + "dataset_df = generate_dataset(docs_subset, llm, dataset)\n", + "dataset_df['reference_answer'] = dataset_df['reference_answer'].str.replace(r'\\[\\/INST\\]', '', regex=True)\n", + "dataset_df['source_raw'] = dataset_df['source_raw'].str.replace(r'\\[\\/INST\\]', '', regex=True)\n", + "\n", + "filtered_df = dataset_df.drop([\"source_sentence\", \"source_document\"], axis=1)\n", + "filtered_df = filtered_df.rename(columns={\n", + " 'question': 'query',\n", + " 'reference_answer': 'reference_answer',\n", + " 'source_raw': 'reference_contexts'\n", + "})\n", + "\n", + "filtered_df[\"model_name\"] = \"llama_3_70B\"\n", + "filtered_df[\"type\"] = \"ai\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save Dataset Files\n", + "\n", + "Creates the final JSONL files in both formats and organizes them in an evaluation_data directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "save_to_jsonl(filtered_df, 'rag_dataset', 'prompt_only')\n", + "save_to_jsonl(filtered_df, 'rag_dataset', 'prompt_with_gt')\n", + "\n", + "if not os.path.exists(\"evaluation_data\"):\n", + " os.mkdir(\"evaluation_data\")\n", + "\n", + "for file in ['rag_dataset_prompt_only.jsonl', 'rag_dataset_prompt_with_gt.jsonl']:\n", + " shutil.move(file, 'evaluation_data/')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Upload to S3 (Optional)\n", + "\n", + "Optional functionality to upload the generated datasets to Amazon S3 for further use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "s3_client = boto3.client('s3', region_name='us-east-1')\n", + "\n", + "for file in ['rag_dataset_prompt_only.jsonl', 'rag_dataset_prompt_with_gt.jsonl']:\n", + " s3_client.upload_file(f'evaluation_data/{file}', bucket_name, f'evaluation_data/{file}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The script generates two types of evaluation datasets in JSONL format, stored in the 'evaluation_data' directory:\n", + "1. prompt_only.jsonl: Contains only questions for basic evaluation\n", + "2. prompt_with_gt.jsonl: Contains questions, reference answers, and contexts for comprehensive evaluation\n", + "\n", + "These datasets can be used to evaluate RAG systems by comparing their responses against the generated reference answers and contexts." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# End" + ] + } + ], + "metadata": { + "availableInstances": [ + { + "_defaultOrder": 0, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.t3.medium", + "vcpuNum": 2 + }, + { + "_defaultOrder": 1, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.t3.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 2, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.t3.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 3, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.t3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 4, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 5, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 6, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 7, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 8, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 9, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 10, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 11, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 12, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5d.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 13, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5d.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 14, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5d.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 15, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5d.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 16, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5d.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 17, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5d.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 18, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5d.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 19, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 20, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": true, + "memoryGiB": 0, + "name": "ml.geospatial.interactive", + "supportedImageNames": [ + "sagemaker-geospatial-v1-0" + ], + "vcpuNum": 0 + }, + { + "_defaultOrder": 21, + "_isFastLaunch": true, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.c5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 22, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.c5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 23, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.c5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 24, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.c5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 25, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 72, + "name": "ml.c5.9xlarge", + "vcpuNum": 36 + }, + { + "_defaultOrder": 26, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 96, + "name": "ml.c5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 27, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 144, + "name": "ml.c5.18xlarge", + "vcpuNum": 72 + }, + { + "_defaultOrder": 28, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.c5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 29, + "_isFastLaunch": true, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g4dn.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 30, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g4dn.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 31, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g4dn.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 32, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g4dn.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 33, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g4dn.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 34, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g4dn.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 35, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 61, + "name": "ml.p3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 36, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 244, + "name": "ml.p3.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 37, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 488, + "name": "ml.p3.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 38, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.p3dn.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 39, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.r5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 40, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.r5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 41, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.r5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 42, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.r5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 43, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.r5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 44, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.r5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 45, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.r5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 46, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.r5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 47, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 48, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 49, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 50, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 51, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 52, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 53, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.g5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 54, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.g5.48xlarge", + "vcpuNum": 192 + }, + { + "_defaultOrder": 55, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 56, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4de.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 57, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.trn1.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 58, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1.32xlarge", + "vcpuNum": 128 + }, + { + "_defaultOrder": 59, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1n.32xlarge", + "vcpuNum": 128 + } + ], + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/evaluation-observe/bedrock-rag-evaluation/knowledge-base-evaluation-job.ipynb b/evaluation-observe/bedrock-rag-evaluation/knowledge-base-evaluation-job.ipynb new file mode 100644 index 00000000..46329e4c --- /dev/null +++ b/evaluation-observe/bedrock-rag-evaluation/knowledge-base-evaluation-job.ipynb @@ -0,0 +1,336 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Amazon Bedrock Knowledge Base Evaluation Guide\n", + "\n", + "## Introduction\n", + "\n", + "Amazon Bedrock Knowledge Base Evaluation provides a comprehensive solution for assessing RAG (Retrieval-Augmented Generation) applications. This guide demonstrates how to evaluate both retrieval and generation components of your RAG system using Amazon Bedrock APIs.\n", + "\n", + "Through this guide, we'll explore:\n", + "- Setting up evaluation configurations\n", + "- Creating retrieval only evaluation jobs\n", + "- Creating retrieval with generation evaluation jobs\n", + "- Monitoring evaluation progress\n", + "\n", + "## Prerequisites\n", + "\n", + "Before we begin, make sure you have:\n", + "- An active AWS account with appropriate permissions\n", + "- Amazon Bedrock access enabled in your preferred region\n", + "- An S3 bucket with CORS enabled for storing evaluation data\n", + "- A created and synced Amazon Bedrock Knowledge Base\n", + "- An IAM role with necessary permissions for S3 and Bedrock\n", + "- To complete these prerequisites, check the how to steps avaialble [here](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-evaluation-prereq.html)\n", + "\n", + "> **Important**: Make sure that your knowledge base is synced and ready before starting any evaluation job.\n", + "\n", + "## Dataset Format\n", + "\n", + "The evaluation data must follow specific JSONL formats based on the type of evaluation:\n", + "\n", + "### Retrieval-only Evaluation Format\n", + "```json\n", + "{\n", + " \"conversationTurns\": [{\n", + " \"referenceContexts\": [{\n", + " \"content\": [{\n", + " \"text\": \"Reference context for evaluation\"\n", + " }]\n", + " }],\n", + " \"prompt\": {\n", + " \"content\": [{\n", + " \"text\": \"Your prompt here\"\n", + " }]\n", + " }\n", + " }]\n", + "}\n", + "```\n", + "\n", + "### Retrieval and Generation Evaluation Format\n", + "```json\n", + "{\n", + " \"conversationTurns\": [{\n", + " \"referenceResponses\": [{\n", + " \"content\": [{\n", + " \"text\": \"Reference response for evaluation\"\n", + " }]\n", + " }],\n", + " \"prompt\": {\n", + " \"content\": [{\n", + " \"text\": \"Your prompt here\"\n", + " }]\n", + " }\n", + " }]\n", + "}\n", + "```\n", + "\n", + "## Dataset Requirements\n", + "\n", + "### Job Requirements\n", + "- Maximum 1000 prompts per evaluation job\n", + "- Each line in the JSONL file must be a complete prompt\n", + "\n", + "### File Requirements\n", + "- File must use JSONL format with `.jsonl` extension\n", + "- Each line must be a valid JSON object\n", + "- File must be stored in an S3 bucket with CORS enabled\n", + "\n", + "### Data Structure Requirements\n", + "For Retrieval-only Evaluation:\n", + "- Must include `referenceContexts` as shown in the format above\n", + "- Each prompt must follow the specified JSON structure\n", + "\n", + "For Retrieval and Generation Evaluation:\n", + "- Optional `referenceResponses` as shown in the format above\n", + "- Must follow the specified JSON structure\n", + "\n", + "> **Note**: When preparing your dataset, consider your evaluation objectives and make sure that your prompts and reference data align with your assessment goals. \n", + "\n", + "## Implementation\n", + "\n", + "First, let's set up our configuration parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import boto3\n", + "import time\n", + "from datetime import datetime\n", + "\n", + "# Generate unique name for the job\n", + "job_name = f\"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}\"\n", + "\n", + "# Configure knowledge base and model settings\n", + "knowledge_base_id = \"\"\n", + "evaluator_model = \"mistral.mistral-large-2402-v1:0\"\n", + "generator_model = \"anthropic.claude-3-sonnet-20240229-v1:0\"\n", + "role_arn = \"arn:aws:iam:::role/\"\n", + "\n", + "# Specify S3 locations\n", + "input_data = \"s3:///evaluation_data/input.jsonl\"\n", + "output_path = \"s3:///evaluation_output/\"\n", + "\n", + "# Configure retrieval settings\n", + "num_results = 5\n", + "search_type = \"HYBRID\"\n", + "\n", + "# Create Bedrock client\n", + "bedrock_client = boto3.client('bedrock', region_name='us-east-1')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a Retrieval-only Evaluation Job\n", + "\n", + "This configuration focuses on assessing the quality of retrieved contexts. Available metrics for retrieval evaluation:\n", + "- `Builtin.ContextRelevance`: Assesses how relevant the retrieved contexts are to the query\n", + "- `Builtin.ContextCoverage`: Measures how well the retrieved contexts cover the information needed" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "retrieval_job = bedrock_client.create_evaluation_job(\n", + " jobName=job_name,\n", + " jobDescription=\"Evaluate retrieval performance\",\n", + " roleArn=role_arn,\n", + " applicationType=\"RagEvaluation\",\n", + " inferenceConfig={\n", + " \"ragConfigs\": [{\n", + " \"knowledgeBaseConfig\": {\n", + " \"retrieveConfig\": {\n", + " \"knowledgeBaseId\": knowledge_base_id,\n", + " \"knowledgeBaseRetrievalConfiguration\": {\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": num_results,\n", + " \"overrideSearchType\": search_type\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }]\n", + " },\n", + " outputDataConfig={\n", + " \"s3Uri\": output_path\n", + " },\n", + " evaluationConfig={\n", + " \"automated\": {\n", + " \"datasetMetricConfigs\": [{\n", + " \"taskType\": \"Custom\",\n", + " \"dataset\": {\n", + " \"name\": \"RagDataset\",\n", + " \"datasetLocation\": {\n", + " \"s3Uri\": input_data\n", + " }\n", + " },\n", + " \"metricNames\": [\n", + " \"Builtin.ContextRelevance\",\n", + " \"Builtin.ContextCoverage\"\n", + " ]\n", + " }],\n", + " \"evaluatorModelConfig\": {\n", + " \"bedrockEvaluatorModels\": [{\n", + " \"modelIdentifier\": evaluator_model\n", + " }]\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a Retrieval and Generation Evaluation Job\n", + "\n", + "This configuration evaluates both retrieval and response generation. Available metrics for this evaluation:\n", + "- `Builtin.Correctness`: Evaluates factual accuracy of generated responses\n", + "- `Builtin.Completeness`: Assesses if all relevant information is included\n", + "- `Builtin.Helpfulness`: Measures how useful the response is\n", + "- `Builtin.LogicalCoherence`: Evaluates response structure and flow\n", + "- `Builtin.Faithfulness`: Checks for hallucinations or made-up information\n", + "- `Builtin.Harmfulness`: Detects harmful content\n", + "- `Builtin.Stereotyping`: Identifies biased or stereotypical responses\n", + "- `Builtin.Refusal`: Evaluates appropriate refusal of problematic requests" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "time.sleep(1)\n", + "job_name_rg = f\"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}\"\n", + "retrieve_generate_job = bedrock_client.create_evaluation_job(\n", + " jobName=job_name_rg,\n", + " jobDescription=\"Evaluate retrieval and generation\",\n", + " roleArn=role_arn,\n", + " applicationType=\"RagEvaluation\",\n", + " inferenceConfig={\n", + " \"ragConfigs\": [{\n", + " \"knowledgeBaseConfig\": {\n", + " \"retrieveAndGenerateConfig\": {\n", + " \"type\": \"KNOWLEDGE_BASE\",\n", + " \"knowledgeBaseConfiguration\": {\n", + " \"knowledgeBaseId\": knowledge_base_id,\n", + " \"modelArn\": generator_model,\n", + " \"retrievalConfiguration\": {\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": num_results,\n", + " \"overrideSearchType\": search_type\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }]\n", + " },\n", + " outputDataConfig={\n", + " \"s3Uri\": output_path\n", + " },\n", + " evaluationConfig={\n", + " \"automated\": {\n", + " \"datasetMetricConfigs\": [{\n", + " \"taskType\": \"Custom\",\n", + " \"dataset\": {\n", + " \"name\": \"RagDataset\",\n", + " \"datasetLocation\": {\n", + " \"s3Uri\": input_data\n", + " }\n", + " },\n", + " \"metricNames\": [\n", + " \"Builtin.Correctness\",\n", + " \"Builtin.Completeness\",\n", + " \"Builtin.Helpfulness\",\n", + " \"Builtin.LogicalCoherence\",\n", + " \"Builtin.Faithfulness\"\n", + " ]\n", + " }],\n", + " \"evaluatorModelConfig\": {\n", + " \"bedrockEvaluatorModels\": [{\n", + " \"modelIdentifier\": evaluator_model\n", + " }]\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Monitoring Job Progress\n", + "\n", + "Track the status of your evaluation job:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get job ARN based on job type\n", + "evaluation_job_arn = retrieval_job['jobArn'] # or retrieve_generate_job['jobArn']\n", + "\n", + "# Check job status\n", + "response = bedrock_client.get_evaluation_job(\n", + " jobIdentifier=evaluation_job_arn \n", + ")\n", + "print(f\"Job Status: {response['status']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this guide, we've walked through the process of implementing Knowledge Base Evaluation using Amazon Bedrock. The feature enables organizations to:\n", + "- Assess AI model outputs across various tasks and contexts\n", + "- Evaluate multiple dimensions of AI performance simultaneously\n", + "- Systematically assess both retrieval and generation quality in RAG systems\n", + "- Scale evaluations across thousands of responses while maintaining quality standards\n", + "\n", + "Remember to follow the best practices outlined above to ensure effective evaluation of your RAG applications." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/evaluation-observe/bedrock-rag-evaluation/synthetic_data/AMAZON_2022_10K.pdf b/evaluation-observe/bedrock-rag-evaluation/synthetic_data/AMAZON_2022_10K.pdf new file mode 100644 index 00000000..a356f262 Binary files /dev/null and b/evaluation-observe/bedrock-rag-evaluation/synthetic_data/AMAZON_2022_10K.pdf differ