Speed up basis mannequin coaching and fine-tuning with new Amazon SageMaker HyperPod recipes

At the moment, we’re asserting the final availability of Amazon SageMaker HyperPod recipes to assist knowledge scientists and builders of all talent units to get began coaching and fine-tuning basis fashions (FMs) in minutes with state-of-the-art efficiency. They’ll now entry optimized recipes for coaching and fine-tuning fashionable publicly obtainable FMs comparable to Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.

At AWS re:Invent 2023, we launched SageMaker HyperPod to scale back time to coach FMs by as much as 40 p.c and scale throughout greater than a thousand compute sources in parallel with preconfigured distributed coaching libraries. With SageMaker HyperPod, you could find the required accelerated compute sources for coaching, create probably the most optimum coaching plans, and run coaching workloads throughout completely different blocks of capability based mostly on the supply of compute sources.

SageMaker HyperPod recipes embody a coaching stack examined by AWS, eradicating tedious work experimenting with completely different mannequin configurations, eliminating weeks of iterative analysis and testing. The recipes automate a number of crucial steps, comparable to loading coaching datasets, making use of distributed coaching methods, automating checkpoints for sooner restoration from faults, and managing the end-to-end coaching loop.

With a easy recipe change, you’ll be able to seamlessly swap between GPU- or Trainium-based situations to additional optimize coaching efficiency and scale back prices. You possibly can simply run workloads in manufacturing on SageMaker HyperPod or SageMaker coaching jobs.

SageMaker HyperPod recipes in motion
To get began, go to the SageMaker HyperPod recipes GitHub repository to browse coaching recipes for fashionable publicly obtainable FMs.

You solely have to edit simple recipe parameters to specify an occasion kind and the placement of your dataset in cluster configuration, then run the recipe with a single line command to attain state-of-art efficiency.

It’s good to edit the recipe config.yaml file to specify the mannequin and cluster kind after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 set up -r necessities.txt.
$ cd ./recipes_collections
$ vim config.yaml

The recipes assist SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker coaching jobs. For instance, you’ll be able to arrange a cluster kind (Slurm orchestrator), a mannequin title (Meta Llama 3.1 405B language mannequin), an occasion kind (ml.p5.48xlarge), and your knowledge places, comparable to storing the coaching knowledge, outcomes, logs, and so forth.

defaults:
- cluster: slurm # assist: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # title of mannequin to be skilled
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or different supported cluster situations
base_results_dir: # Location(s) to retailer the outcomes, checkpoints, logs and many others.

You possibly can optionally alter model-specific coaching parameters on this YAML file, which outlines the optimum configuration, together with the variety of accelerator gadgets, occasion kind, coaching precision, parallelization and sharding methods, the optimizer, and logging to observe experiments by way of TensorBoard.

run:
  title: llama-405b
  results_dir: ${base_results_dir}/${.title}
  time_limit: "6-00:00:00"
restore_from_path: null
coach:
  gadgets: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  title: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # different model-specific params

To run this recipe in SageMaker HyperPod with Slurm, you have to put together the SageMaker HyperPod cluster following the cluster setup instruction.

Then, connect with the SageMaker HyperPod head node, entry the Slurm controller, and duplicate the edited recipe. Subsequent, you run a helper file to generate a Slurm submission script for the job that you need to use for a dry run to examine the content material earlier than beginning the coaching job.

$ python3 most important.py --config-path recipes_collection --config-name=config

After coaching completion, the skilled mannequin is routinely saved to your assigned knowledge location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, set up the necessities, and edit the recipe (cluster: k8s) in your laptop computer. Then, create a hyperlink between your laptop computer and operating the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora 
--persistent-volume-claims fsx-claim:knowledge 
--override-parameters 
'{
  "recipes.run.title": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/knowledge/",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.mannequin.knowledge.train_dir": "",
  "recipes.mannequin.knowledge.val_dir": "",
}'

You can even run recipe on SageMaker coaching jobs utilizing SageMaker Python SDK. The next instance is operating PyTorch coaching scripts on SageMaker coaching jobs with overriding coaching recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/choose/ml/mannequin",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/choose/ml/output/tensorboard",
        "checkpoint_dir": "/choose/ml/checkpoints",
    },   
    "mannequin": {
        "knowledge": {
            "train_dir": "/choose/ml/enter/knowledge/practice",
            "val_dir": "/choose/ml/enter/knowledge/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=,
           base_job_name=f"llama-recipe",
           function=,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As coaching progresses, the mannequin checkpoints are saved on Amazon Easy Storage Service (Amazon S3) with the totally automated checkpointing functionality, enabling sooner restoration from coaching faults and occasion restarts.

Now obtainable
Amazon SageMaker HyperPod recipes at the moment are obtainable within the SageMaker HyperPod recipes GitHub repository. To be taught extra, go to the SageMaker HyperPod product web page and the Amazon SageMaker AI Developer Information.

Give SageMaker HyperPod recipes a attempt to ship suggestions to AWS re:Publish for SageMaker or by way of your standard AWS Assist contacts.

— Channy

Leave a Reply Cancel reply

Related News

Now open – AWS Asia Pacific (Taipei) Area

How one can take a look at your Java purposes with JUnit 5