Host llama2 13b as sagemaker endpoint

Sayli_D · November 22, 2023, 5:17pm

aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0524d4b4",
   "metadata": {},
   "source": [
    "#  Using TorchServe on SageMaker Inf2.24xlarge with Llama2-13B"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92a529f6",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "This notebook uses SageMaker notebook instance conda_python3 kernel, demonstrates how to use TorchServe to deploy Llama-2-13 on SageMaker inf2.24xlarge. There are multiple advanced features in this example.\n",
    "\n",
    "* Neuronx AOT precompile model\n",

This file has been truncated. show original

Could create an endpoint as above for llama 13b base, but it gives a timeout error on container primary for 13b chat.

For above, created the neuron artifacts for the 13b chat model using this -

github.com

pytorch/serve/blob/master/examples/large_models/inferentia2/llama2/Readme.md?plain=1#L56


      
          cd serve
          
          # Install dependencies
          python ts_scripts/install_dependencies.py --neuronx --environment=dev
          
          # Install torchserve and torch-model-archiver
          python ts_scripts/install_from_src.py
          
          # Install additional neuron packages, SDK 2.12.2: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/content.html#id8
          python -m pip install neuronx-cc==2.8.0.25 torch-neuronx==1.13.1.1.9.1 transformers-neuronx==0.5.58
          
          # Navigate to `examples/large_models/inferentia2/llama2` directory
          cd examples/large_models/inferentia2/llama2/
          
          # Install additional necessary packages
          python -m pip install -r requirements.txt
          ```
          
          ### Step 3: Save the model artifacts compatible with `transformers-neuronx`
          In order to use the pre-compiled model artifacts, copy them from the model zoo using the command shown below and skip to **Step 5**
          ```bash

Could start torchserve and run inference via curl command here, so the model artifacts look okay. But the same artifacts won’t work in the first notebook reference link.