ExperimentPlanTemplateTemplatePipelineEnvParams

data class ExperimentPlanTemplateTemplatePipelineEnvParams(val cpuPerWorker: Int, val cudaVersion: String? = null, val gpuDriverVersion: String? = null, val gpuPerWorker: Int, val memoryPerWorker: Int, val ncclVersion: String? = null, val pyTorchVersion: String? = null, val shareMemory: Int, val workerNum: Int)

Constructors

constructor(cpuPerWorker: Int, cudaVersion: String? = null, gpuDriverVersion: String? = null, gpuPerWorker: Int, memoryPerWorker: Int, ncclVersion: String? = null, pyTorchVersion: String? = null, shareMemory: Int, workerNum: Int)

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard

Number of central processing units (CPUs) allocated. This parameter affects the processing power of the computation, especially in tasks that require a large amount of parallel processing.

Link copied to clipboard
val cudaVersion: String? = null

The version of CUDA(Compute Unified Device Architecture) used. CUDA is a parallel computing platform and programming model provided by NVIDIA. A specific version may affect the available GPU functions and performance optimization.

Link copied to clipboard

The version of the GPU driver used. Driver version may affect GPU performance and compatibility, so it is important to ensure that the correct version is used

Link copied to clipboard

Number of graphics processing units (GPUs). GPUs are a key component in deep learning and large-scale data processing, so this parameter is very important for tasks that require graphics-accelerated computing.

Link copied to clipboard

The amount of memory available. Memory size has an important impact on the performance and stability of the program, especially when dealing with large data sets or high-dimensional data.

Link copied to clipboard
val ncclVersion: String? = null

The NVIDIA Collective Communications Library(NCCL) version used. NCCL is a library for multi-GPU and multi-node communication. This parameter is particularly important for optimizing data transmission in distributed computing.

Link copied to clipboard
val pyTorchVersion: String? = null

The version of the PyTorch framework used. PyTorch is a widely used deep learning library, and differences between versions may affect the performance and functional support of model training and inference.

Link copied to clipboard

Shared memory GB allocation

Link copied to clipboard

The total number of nodes. This parameter directly affects the parallelism and computing speed of the task, and a higher number of working nodes usually accelerates the completion of the task.