funsloth-local

chrisvoncsefalvay/funsloth-local

AI & ML

About

SKILL.md

funsloth-local

chrisvoncsefalvay/funsloth-local

AI & ML

About

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

SKILL.md

Local GPU Training Manager

Run Unsloth training on your local GPU.

Prerequisites Check

1. Verify CUDA

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA not available:

Check NVIDIA drivers: nvidia-smi
Check CUDA: nvcc --version
Reinstall PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu121

2. Check VRAM

See references/HARDWARE_GUIDE.md for requirements:

VRAM	Recommended Setup
8GB	7B, 4-bit, batch=1, LoRA r=8
12GB	7B, 4-bit, batch=2, LoRA r=16
16GB	7-13B, 4-bit, batch=2, LoRA r=16-32
24GB	7-14B, 4-bit, batch=4, LoRA r=32

3. Check Dependencies

pip install unsloth torch transformers trl peft datasets accelerate bitsandbytes

Docker Option

Use the official Unsloth Docker image for a pre-configured environment (supports all GPUs including Blackwell/50-series):

docker run -d \
  -e JUPYTER_PASSWORD="unsloth" \
  -p 8888:8888 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter at http://localhost:8888. Example notebooks are in /workspace/unsloth-notebooks/.

Environment variables:

JUPYTER_PASSWORD - Jupyter auth (default: unsloth)
JUPYTER_PORT - Port (default: 8888)
USER_PASSWORD - User/sudo password (default: unsloth)

Run Training

Option 1: Notebook

jupyter notebook notebooks/sft_template.ipynb

Option 2: Script

# Edit configuration in script, then run
python scripts/train_sft.py

GPU Selection (Multi-GPU)

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

Monitor Training

Terminal

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use nvitop (more detailed)
pip install nvitop && nvitop

WandB (Optional)

export WANDB_API_KEY="your-key"
# Add report_to="wandb" in TrainingArguments

Troubleshooting

OOM Error

Try in order:

Reduce batch_size (to 1)
Increase gradient_accumulation
Reduce max_seq_length
Reduce LoRA rank
torch.cuda.empty_cache()

Loss Not Decreasing

Check learning rate (try higher or lower)
Verify chat template matches model
Inspect data format

Training Too Slow

Enable bf16 if supported
Use packing=True for short sequences
Reduce logging_steps

See references/TROUBLESHOOTING.md for more solutions.

Resume from Checkpoint

TrainingArguments(
    resume_from_checkpoint=True,  # Auto-find latest
    # Or: resume_from_checkpoint="outputs/checkpoint-500"
)

Save Model

Training script automatically saves:

outputs/lora_adapter/ - LoRA weights
outputs/merged_16bit/ - Merged model (optional)

Test Inference

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("outputs/lora_adapter")
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Handoff

Offer funsloth-upload for Hub upload with model card.

Tips

Close other GPU apps before training
Monitor temps - keep under 85C
Use UPS for long runs
Save frequently with save_steps

Bundled Resources

notebooks/sft_template.ipynb - Notebook template
scripts/train_sft.py - Script template
references/HARDWARE_GUIDE.md - VRAM requirements
references/TROUBLESHOOTING.md - Common issues

About

SKILL.md

About

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

SKILL.md

Local GPU Training Manager

Run Unsloth training on your local GPU.

Prerequisites Check

1. Verify CUDA

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA not available:

Check NVIDIA drivers: nvidia-smi
Check CUDA: nvcc --version
Reinstall PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu121

2. Check VRAM

See references/HARDWARE_GUIDE.md for requirements:

VRAM	Recommended Setup
8GB	7B, 4-bit, batch=1, LoRA r=8
12GB	7B, 4-bit, batch=2, LoRA r=16
16GB	7-13B, 4-bit, batch=2, LoRA r=16-32
24GB	7-14B, 4-bit, batch=4, LoRA r=32

3. Check Dependencies

pip install unsloth torch transformers trl peft datasets accelerate bitsandbytes

Docker Option

Use the official Unsloth Docker image for a pre-configured environment (supports all GPUs including Blackwell/50-series):

docker run -d \
  -e JUPYTER_PASSWORD="unsloth" \
  -p 8888:8888 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter at http://localhost:8888. Example notebooks are in /workspace/unsloth-notebooks/.

Environment variables:

JUPYTER_PASSWORD - Jupyter auth (default: unsloth)
JUPYTER_PORT - Port (default: 8888)
USER_PASSWORD - User/sudo password (default: unsloth)

Run Training

Option 1: Notebook

jupyter notebook notebooks/sft_template.ipynb

Option 2: Script

# Edit configuration in script, then run
python scripts/train_sft.py

GPU Selection (Multi-GPU)

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

Monitor Training

Terminal

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use nvitop (more detailed)
pip install nvitop && nvitop

WandB (Optional)

export WANDB_API_KEY="your-key"
# Add report_to="wandb" in TrainingArguments

Troubleshooting

OOM Error

Try in order:

Reduce batch_size (to 1)
Increase gradient_accumulation
Reduce max_seq_length
Reduce LoRA rank
torch.cuda.empty_cache()

Loss Not Decreasing

Check learning rate (try higher or lower)
Verify chat template matches model
Inspect data format

Training Too Slow

Enable bf16 if supported
Use packing=True for short sequences
Reduce logging_steps

See references/TROUBLESHOOTING.md for more solutions.

Resume from Checkpoint

TrainingArguments(
    resume_from_checkpoint=True,  # Auto-find latest
    # Or: resume_from_checkpoint="outputs/checkpoint-500"
)

Save Model

Training script automatically saves:

outputs/lora_adapter/ - LoRA weights
outputs/merged_16bit/ - Merged model (optional)

Test Inference

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("outputs/lora_adapter")
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Handoff

Offer funsloth-upload for Hub upload with model card.

Tips

Close other GPU apps before training
Monitor temps - keep under 85C
Use UPS for long runs
Save frequently with save_steps

Bundled Resources

notebooks/sft_template.ipynb - Notebook template
scripts/train_sft.py - Script template
references/HARDWARE_GUIDE.md - VRAM requirements
references/TROUBLESHOOTING.md - Common issues