Docs
gpu
Best Practices for Large AI Models
Rapid Deployment of LLaMA-Factory

Quick Deployment of LLaMA-Factory

Overview

LLaMA-Factory is an open-source full-stack large-scale model fine-tuning framework, covering pre-training, instruction fine-tuning to RLHF stages. It’s efficient, easy to use, scalable, and comes with a zero-code visualization one-stop web fine-tuning interface LLaMA Board. It includes multiple training methods such as pre-training, supervised fine-tuning, and RLHF, supports 0-1 replication of the ChatGPT training process, a wealth of Chinese and English parameter prompts, real-time status monitoring, and concise model checkpoint management. It supports web page reconnection and refreshes.

Quick Deployment

Log in to the UCloud Global console (https://console.ucloud-global.com/uhost/uhost/create ), choose the machine type as “GPU Type,” “High Cost-performance Graphics Card 6,” and make detailed configurations such as the quantity of CPU and GPU cores as needed.
Lowest Recommended Configuration: 16-core CPU 64G RAM 1 GPU.
Select “Image Market” for the image, search for “LLaMA-Factory” for the image name, and select this image to create a GPU cloud host.
After the successful creation of the GPU cloud host, log into the GPU cloud host.

Operation Practice

Visit http://ip:7860 through your browser. Please replace the IP with the external IP of the cloud host, and check the security rules in the console if you cannot access it.

1. Loading Device

Note: The models in 1 and 2 must correspond one by one. The path in 2 is the path of the model downloaded on the local virtual machine.

2. Operation


3. Training Parameters


After training with the web ui, you can refresh local Lora weights by clicking refresh adapters, and specify the trained Lora weights from the adapter path.

After the weight is selected, the training will continue with the trained Lora weight.

After the weight is selected, the dialogue will be with the trained Lora model.

Note: As webui has a path rule, it can only recognize Lora weights trained by webui, and cannot recognize weights trained by command line.


The Quantization bit needs to be set to 4, otherwise, the 24G video memory of the current recommended model will be tight.

Prompt template is bound to the corresponding model, it will be brought out automatically after choosing the model name above, it does not need to be adjusted.


Choose the training dataset in the position in the picture, click preview dataset to preview the training data.

If you need to add a custom dataset, please refer to /home/ubuntu/LLaMA-Factory/data/README.md


Learning rate between 1e-5 and 5e-4 (without changing the optimizer). It’s somewhat mysterious, you can try several learning rates when brushing up the score.

Epochs can be adjusted according to the loss drop image, datasets with less than 10,000 entries can start with epochs=3, if the tail of the loss image hasn’t become slack, increase it. Datasets with more than 60,000 texts (average of more than 20 words per entry) are between epochs=1 and 3.

The recommended value for Maximum gradient norm is between 1-10.

Max samples are to limit the use of the first few entries of the dataset, generally training with full data, can be set to 10 to avoid excessive verification time during testing.


The Cutoff length is the truncation length of the training sentence, the longer the sentence, the more video memory is used, if there is not enough video memory, consider reducing it to 512 or even 256. It can be set according to the length required by the fine-tuning target. After fine-tuning, the model’s ability to process sentences longer than cut off length will decrease.

The Batch size will affect the quality of the model’s training, the speed of training, and the video memory. In terms of training quality, the product of Batch size * Gradient accumulation is generally between 16 and 64, and this product is the final effective batch size. In terms of training speed, batch size is faster when larger within the observable range of 4090. In terms of video memory, batch size =4 is the limit, reduce it if you can’t fit.

In cases where sentences are very long, such as 2048, you can set batch size=1, gradient accumulation=16. Gradient accumulation does not affect video memory, but it affects the equivalent batch size, hence affecting the quality of training.


Save steps is how many steps to store a model checkpoint, convenient for continuing training in case of unexpected interruption of training. Total steps = epochs * dataset number / batch size / gradient accumulation, this will also be shown in a progress bar during training. Since each stored checkpoint is the full weight of the model, a breakpoint is stored as tens of G.

It is recommended to increase to over 1000, otherwise, the hard disk capacity will easily be insufficient. After each training, you can delete it from /home/ubuntu/LLaMA-Factory/saves/ChineseLLaMA2-13B-Chat/lora/…


You can get the equivalent command that can be run at the command line through Preview command. Output dir is the storage path for training results and breakpoints. After setting the parameters, you can click start to start training. You can also copy the command and run it under /home/ubuntu/LLaMA-Factory, in the llama_factory environment of conda.