And use it in the following codes.
import os import openai client = openai.OpenAI( base_url="https://llama2-13b.lepton.run/api/v1/", api_key=os.environ.get('LEPTON_API_TOKEN') ) completion = client.chat.completions.create( model="llama2-13b", messages=[ {"role": "user", "content": "say hello"}, ], max_tokens=128, stream=True, ) for chunk in completion: if not chunk.choices: continue content = chunk.choices[0].delta.content if content: print(content, end="")
The rate limit for the Serverless Endpoints is 10 requests per minute across all models under Basic Plan. If you need a higher rate limit with SLA please upgrade to standard plan, or use dedicated deployment.