lep deployment

Manage deployments on the Lepton AI cloud.

Deployment is a running instance of a photon. Deployments are created using the lep photon run command. Usually, a deployment exposes one or more HTTP endpoints that the users call, either via a RESTful API, or a python client defined in leptonai.client.

The deployment commands allow you to list, manage, and remove deployments on the Lepton AI cloud.

Usage

lep deployment [OPTIONS] COMMAND [ARGS]...

Options

--help : Show this message and exit.

Commands

create : Creates a deployment from either a photon or container image.
events : List events of the deployment
list : Lists all deployments in the current workspace.
log : Gets the log of a deployment.
remove : Removes a deployment.
status : Gets the status of a deployment.
update : Updates a deployment.

lep deployment create

Creates a deployment from either a photon or container image.

Usage

lep deployment create [OPTIONS]

Options

-n, --name TEXT : Name of the deployment being created.
-p, --photon TEXT : Name of the photon to run.
-i, --photon-id TEXT : Specific version id of the photon to run. If not specified, we will run the most recent version of the photon.
--container-image TEXT : Container image to run.
--container-port INTEGER : Guest OS port to listen to in the container. If not specified, default to 8080.
--container-command TEXT : Command to run in the container. Your command should listen to the port specified by --container-port.
--resource-shape TEXT : Resource shape for the deployment. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.
--min-replicas INTEGER : (Will be deprecated soon) Minimum number of replicas.
--max-replicas INTEGER : (Will be deprecated) Maximum number of replicas.
--mount TEXT : Persistent storage to be mounted to the deployment, in the format STORAGE_PATH:MOUNT_PATH.
-e, --env TEXT : Environment variables to pass to the deployment, in the format NAME=VALUE.
-s, --secret TEXT : Secrets to pass to the deployment, in the format NAME=SECRET_NAME. If secret name is also the environment variable name, you can omit it and simply pass SECRET_NAME.
--public : If specified, the photon will be publicly accessible. See docs for details on access control.
--tokens TEXT : Additional tokens that can be used to access the photon. See docs for details on access control.
--no-traffic-timeout INTEGER : (Will be deprecated soon)If specified, the deployment will be scaled down to 0 replicas after the specified number of seconds without traffic. Minimum is 60 seconds if set. Note that actual timeout may be up to 30 seconds longer than the specified value.
--target-gpu-utilization INTEGER : (Will be deprecated soon)If min and max replicas are set, if the gpu utilization is higher than the target gpu utilization, autoscaler will scale up the replicas. If the gpu utilization is lower than the target gpu utilization, autoscaler will scale down the replicas. The value should be between 0 and 99.
--initial-delay-seconds INTEGER : If specified, the deployment will allow the specified amount of seconds for the photon to initialize before it starts the service. Usually you should not need this. If you have a deployment that takes a long time to initialize, set it to a longer value.
--include-workspace-token : If specified, the workspace token will be included as an environment variable. This is used when the photon code uses Lepton SDK capabilities such as queue, KV, objectstore etc. Note that you should trust the code in the photon, as it will have access to the workspace token.
--rerun : If specified, shutdown the deployment of the same deployment name and rerun it. Note that this may cause downtime of the photon if it is for production use, so use with caution. In a production environment, you should do photon create, push, and lep deployment update instead.
--public-photon : If specified, get the photon from the public photon registry. This is only supported for remote execution.
--image-pull-secrets TEXT : Secrets to use for pulling images.
-ng, --node-group TEXT : Node group for the deployment. If not set, use on-demand resources. You can repeat this flag multiple times to choose multiple node groups. Multiple node group option is currently not supported but coming soon for enterprise users. Only the first node group will be set if you input multiple node groups at this time.
--visibility TEXT : Visibility of the deployment. Can be 'public' or 'private'. If private, the deployment will only be viewable by the creator and workspace admin.
-r, -replicas, --replicas-static INTEGER : Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2
-ad, --autoscale-down TEXT : Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)
-agu, --autoscale-gpu-util TEXT : Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)

If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.

-aq, --autoscale-qpm TEXT : Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)

This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.

-lg, --log-collection BOOLEAN : Enable or disable log collection (true/false). If not provided, the workspace setting will be used.
--help : Show this message and exit.

lep deployment list

Lists all deployments in the current workspace.

Usage

lep deployment list [OPTIONS]

Options

-p, --pattern TEXT : Regular expression pattern to filter deployment names.
--help : Show this message and exit.

lep deployment remove

Removes a deployment.

Usage

lep deployment remove [OPTIONS]

Options

-n, --name TEXT : The deployment name to remove. [required]
--help : Show this message and exit.

lep deployment status

Gets the status of a deployment.

Usage

lep deployment status [OPTIONS]

Options

-n, --name TEXT : The deployment name to get status. [required]
-t, --show-tokens : Show tokens for the deployment. Use with caution as this displays the tokens in plain text, and may be visible to others if you log the output.
-d, --detail : Show the deployment detail
--help : Show this message and exit.

lep deployment log

Gets the log of a deployment. If replica is not specified, the first replica is selected. Otherwise, the log of the specified replica is shown. To get the list of replicas, use lep deployment status.

Usage

lep deployment log [OPTIONS]

Options

-n, --name TEXT : The deployment name to get log. [required]
-r, --replica TEXT : The replica name to get log.
--help : Show this message and exit.

-n, --name TEXT : The deployment name to update. [required]
-i, --id TEXT : The new photon id to update to. Use latest for the latest id.
--min-replicas INTEGER : Number of replicas to update to. Pass 0 to scale the number of replicas to zero, in which case the deployemnt status page will show the deployment to be not ready until you scale it back with a positive number of replicas.
--resource-shape TEXT : Resource shape for the pod. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.
--public / --no-public : If --public is specified, the deployment will be made public. If --no-public is specified, the deployment will be made non-public, with access tokens being the workspace token and the tokens specified by --tokens. If neither is specified, no change will be made to the access control of the deployment.
--tokens TEXT : Access tokens that can be used to access the deployment. See docs for details on access control. If no tokens is specified, we will not change the tokens of the deployment. If you want to remove all additional tokens, use--remove-tokens.
--remove-tokens : If specified, all additional tokens will be removed, and the deployment will be either public (if --public) is specified, or only accessible with the workspace token (if --public is not specified).
--no-traffic-timeout INTEGER : If specified, the deployment will be scaled down to 0 replicas after the specified number of seconds without traffic. Set to 0 to explicitly change the deployment to have no timeout.
--visibility TEXT : Visibility of the deployment. Can be 'public' or 'private'. If private, the deployment will only be viewable by the creator and workspace admin.
-r, -replicas, --replicas-static INTEGER : Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2
-ad, --autoscale-down TEXT : Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)
-agu, --autoscale-gpu-util TEXT : Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)

-aq, --autoscale-qpm TEXT : Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)

This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.

-lg, --log-collection BOOLEAN : Enable or disable log collection (true/false). If not provided, the workspace setting will be used.
--help : Show this message and exit.

-n, --name TEXT : The deployment name to get status. [required]
--help : Show this message and exit.