strix-halo-setup

ianbarber/strix-halo-setup

AI & ML

About

SKILL.md

strix-halo-setup

ianbarber/strix-halo-setup

AI & ML

About

Complete setup for AMD Strix Halo (Ryzen AI MAX+ 395) PyTorch environments...

SKILL.md

Strix Halo Setup

Set up a new PyTorch project optimized for AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151).

When Claude Should Use This Skill

This skill should be invoked when:

Setting up PyTorch on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) hardware
User reports "HIP error: invalid device function" with PyTorch on AMD APU
Configuring environments for running LLMs on AMD integrated graphics
User mentions needing GTT memory configuration for ML workloads
Creating a new ML project specifically for Strix Halo hardware
User asks about running 30B parameter models on AMD Ryzen AI MAX+

What This Skill Does

Verifies system configuration (ROCm, GTT, user groups)
Creates a conda environment with working PyTorch for gfx1151
Sets up proper environment variables
Creates test scripts to verify GPU functionality
Provides a complete project template with best practices

Critical Information

PyTorch Installation: Official PyTorch wheels from pytorch.org DO NOT WORK with gfx1151. They detect the GPU but fail on compute with "HIP error: invalid device function". This skill installs community builds that actually work.

ROCm Installation Note: For Strix Halo APUs, ROCm should be installed with --no-dkms flag to use the inbox kernel driver. If you have amdgpu-dkms installed, it may cause issues when upgrading kernels.

Prerequisites Check

Before running setup, verify the system with:

./scripts/verify_system.sh

This checks:

ROCm installation (6.4.4+ or 7.x required)
User in render and video groups
GTT memory configuration
Python/Conda availability

ROCm 7.x Note: ROCm 7.2+ offers significant performance improvements (~2.5x in BF16 compute). However, hipBLASLt is not fully optimized for gfx1151 yet and falls back to hipBLAS.

If any checks fail, see .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md for detailed fix instructions.

Setup Process

Step 1: System Verification

Run the verification script:

cd .claude/skills/strix-halo-setup
./scripts/verify_system.sh

Expected output:

✓ AMD GPU detected
✓ ROCm installed
✓ User in render/video groups
✓ GTT configured (or warning if not)

If issues found, follow the script's instructions to fix them.

Step 2: Determine Project Name and Backend

Ask the user for:

Project name: If not specified, use strix-ml-project
Backend choice: PyTorch (training/custom code) or Vulkan (inference only)

Use the AskUserQuestion tool:

Question 1: "What would you like to name your project?" Question 2: "Which backend do you want to set up?"

PyTorch with ROCm: For training, custom code, full ML framework (supports transformers, etc.)
Vulkan: For inference only (llama.cpp, Ollama) - simpler setup, often faster

If PyTorch is chosen, continue with steps below. If Vulkan, skip to Vulkan setup section at the end.

Step 3: Create Environment

Using Conda (Recommended):

# Create new environment with Python 3.14 (or 3.13)
conda create -n {project_name} python=3.14 -y
conda activate {project_name}

Using uv (Alternative):

# Create new environment with Python 3.14 (or 3.13)
uv venv {project_name} --python 3.14
source {project_name}/bin/activate

Step 4: Install PyTorch (Community Build)

CRITICAL: Must use community builds, not official wheels.

Option 1: AMD Nightlies (Recommended)

pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchvision torchaudio

As of January 2026, this provides PyTorch 2.11.0a0+ with ROCm 7.11.0 support.

Option 2: TheRock Builds (Alternative) Pre-built wheels from the official ROCm TheRock project: https://github.com/ROCm/TheRock/releases

Look for gfx1151 releases and install with pip install <wheel_file>.

Verify Installation:

python -c "import torch; print('PyTorch:', torch.__version__); print('HIP:', torch.version.hip)"

Should show PyTorch 2.11+ and HIP 7.2+ with ROCm 7.x.

Step 5: Configure Environment Variables

Create activation script in the conda environment:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d

cat > $CONDA_PREFIX/etc/conda/activate.d/strix_halo_env.sh << 'EOF'
#!/bin/bash

# Core ROCm settings for Strix Halo (gfx1151)
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export PYTORCH_ROCM_ARCH=gfx1151

# Unified Memory Configuration - CRITICAL for accessing full memory
export HSA_XNACK=1
export HSA_FORCE_FINE_GRAIN_PCIE=1

# Memory allocation settings
export GPU_MAX_HEAP_SIZE=100
export GPU_MAX_ALLOC_PERCENT=100

# Device visibility
export ROCR_VISIBLE_DEVICES=0
export HIP_VISIBLE_DEVICES=0

# Performance optimizations
export ROCBLAS_USE_HIPBLASLT=1
export AMD_LOG_LEVEL=0
export HSA_CU_MASK=0xffffffffffffffff

# ROCm 7.x stability fixes for APUs
export HSA_ENABLE_SDMA=0  # Prevents checkerboard artifacts in VAE decodes

# PyTorch memory management (recommended for 32GB+ workloads)
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"

echo "✓ Strix Halo environment variables set"
EOF

chmod +x $CONDA_PREFIX/etc/conda/activate.d/strix_halo_env.sh

Create deactivation script:

mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

cat > $CONDA_PREFIX/etc/conda/deactivate.d/strix_halo_env.sh << 'EOF'
#!/bin/bash

unset HSA_OVERRIDE_GFX_VERSION PYTORCH_ROCM_ARCH HSA_XNACK HSA_FORCE_FINE_GRAIN_PCIE
unset GPU_MAX_HEAP_SIZE GPU_MAX_ALLOC_PERCENT ROCR_VISIBLE_DEVICES HIP_VISIBLE_DEVICES
unset ROCBLAS_USE_HIPBLASLT AMD_LOG_LEVEL HSA_CU_MASK HSA_ENABLE_SDMA PYTORCH_HIP_ALLOC_CONF
EOF

chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/strix_halo_env.sh

Step 6: Create Project Structure

mkdir -p {project_name}/{scripts,notebooks,data,models,tests}
cd {project_name}

Step 7: Copy Test Scripts

Copy the test scripts from the skill directory:

cp .claude/skills/strix-halo-setup/scripts/*.py scripts/
chmod +x scripts/*.py

Step 8: Create Project README

Create a README with project-specific information:

cat > README.md << 'EOF'
# {Project Name}

PyTorch project optimized for AMD Strix Halo (gfx1151).

## Environment

- **Hardware**: AMD Strix Halo (gfx1151)
- **ROCm**: 6.4.2+
- **PyTorch**: Community build for gfx1151
- **Python**: 3.12

## Setup

```bash
# Activate environment
conda activate {project_name}

# Verify GPU
python scripts/test_gpu_simple.py

# Test memory capacity
python scripts/test_memory.py

Hardware Capabilities

Compute: ~7 TFLOPS FP32, ~31 TFLOPS BF16 (with ROCm 7.x)
Memory: Up to 113GB GPU-accessible (with GTT configuration)
Model Capacity: 30B parameter models in FP16

Best Practices

Use BF16 for 1.6x speedup over FP32
Keep batch size small (1-4) for inference
Data in VRAM is faster than GTT memory
Monitor memory: rocm-smi --showmeminfo gtt

Troubleshooting

If compute fails with "HIP error: invalid device function":

You're using official PyTorch wheels (don't work with gfx1151)
Reinstall: pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch

For more help, see .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md

Created: {date} EOF


### Step 9: Verify Installation

Reactivate the environment to load variables:

```bash
conda deactivate
conda activate {project_name}

# Should see: "✓ Strix Halo environment variables set"

Run verification:

python scripts/test_gpu_simple.py

Expected output:

============================================================
STRIX HALO GPU TEST
============================================================
✓ GPU detected: AMD Radeon Graphics
  Memory: 113.2 GB
  Compute test successful
✓ ALL TESTS PASSED
============================================================

Step 10: Final Summary

Tell the user:

✓ Setup complete! Your Strix Halo environment is ready.

Project: {project_name}
Location: {full_path}

Next steps:
  1. Test GPU: python scripts/test_gpu_simple.py
  2. Test memory: python scripts/test_memory.py

Hardware capabilities:
  - 7 TFLOPS FP32 / 31 TFLOPS BF16 (ROCm 7.x)
  - 113 GB GPU-accessible memory
  - Can run 30B parameter models in FP16

Activate anytime with: conda activate {project_name}

Success Criteria

All of these should pass:

✓ PyTorch detects GPU
✓ Compute operations succeed (no HIP errors)
✓ Can allocate 30GB+ memory
✓ BF16 operations work

Common Issues

Issue: "HIP error: invalid device function"

Cause: Using official PyTorch wheels (don't work with gfx1151)

Solution:

pip uninstall torch torchvision torchaudio
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchvision torchaudio

Verify it worked:

python -c "import torch; a=torch.tensor([1.0]).cuda(); print('✓ Works:', (a+1).item())"

Issue: Out of memory below 30GB

Cause: GTT not configured (limited to ~33GB)

Solution 1: Upgrade to kernel 6.16.9+ (no configuration needed)

Solution 2: For older kernels, configure GTT:

.claude/skills/strix-halo-setup/scripts/configure_gtt.sh

This adds kernel parameters to GRUB for GPU to access more system RAM.

Issue: GPU not detected

Cause: User not in render/video groups

Solution:

sudo usermod -aG render,video $USER
# Log out and back in (or reboot)
groups | grep -E "render|video"  # Verify

Vulkan Setup (Alternative to PyTorch)

If the user chose Vulkan for inference-only workloads:

Step V1: Install Vulkan Drivers

sudo apt install mesa-vulkan-drivers vulkan-tools

Step V2: Verify Vulkan

vulkaninfo | grep "deviceName"
# Should show: AMD Radeon Graphics or similar

Step V3: Install Inference Tools

For llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_VULKAN=1

For Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step V4: Test Vulkan

# With llama.cpp
./llama-cli -m /path/to/model.gguf -ngl 99 --gpu-backend vulkan

# With Ollama
ollama run llama2

Vulkan Summary

Tell the user:

✓ Vulkan setup complete!

Backend: Vulkan (inference only)
Use with: llama.cpp, Ollama, other Vulkan-enabled tools

Note: Vulkan often provides better performance for inference than ROCm/HIP.
For training or custom PyTorch code, set up PyTorch instead.

References

Troubleshooting: .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md
GTT Configuration: .claude/skills/strix-halo-setup/docs/GTT_MEMORY_FIX.md
Community PyTorch: https://github.com/ROCm/TheRock/releases

Notes

GTT configuration needed for 30B+ models on kernels before 6.16.9 (kernel 6.16.9+ has automatic UMA support)
Vulkan backend often provides better performance for inference
Use BF16 precision in PyTorch for better performance

ROCm 7.x Considerations

Benefits of ROCm 7.2+:

Up to 5x performance improvement in image generation (ComfyUI, Flux, SDXL)
~2.5x improvement in BF16 compute (31 TFLOPS vs 12 TFLOPS on ROCm 6.x)
Unified Windows/Linux release
Better long-term support path
JAX 0.8.0 support

Known Limitations on gfx1151 (as of ROCm 7.2):

hipBLASLt: Falls back to hipBLAS (slower) due to unsupported architecture
LLM Decode: Memory-copy bound (~92-95% time in hipMemcpyWithStream), limiting token throughput
Flash Attention: May require TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and still has issues

Deprecations in ROCm 7.2:

HIPCC is deprecated; AMD Clang should be used directly for compilation
ROCTracer, ROCProfiler, rocprof, rocprofv2 are deprecated; use ROCprofiler-SDK and AMD SMI instead

Recommended Environment Variables for ROCm 7.x:

export HSA_ENABLE_SDMA=0  # Prevents artifacts in VAE decodes
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"

About

SKILL.md

About

Complete setup for AMD Strix Halo (Ryzen AI MAX+ 395) PyTorch environments...

SKILL.md

Strix Halo Setup

Set up a new PyTorch project optimized for AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151).

When Claude Should Use This Skill

This skill should be invoked when:

Setting up PyTorch on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) hardware
User reports "HIP error: invalid device function" with PyTorch on AMD APU
Configuring environments for running LLMs on AMD integrated graphics
User mentions needing GTT memory configuration for ML workloads
Creating a new ML project specifically for Strix Halo hardware
User asks about running 30B parameter models on AMD Ryzen AI MAX+

What This Skill Does

Verifies system configuration (ROCm, GTT, user groups)
Creates a conda environment with working PyTorch for gfx1151
Sets up proper environment variables
Creates test scripts to verify GPU functionality
Provides a complete project template with best practices

Critical Information

Prerequisites Check

Before running setup, verify the system with:

./scripts/verify_system.sh

This checks:

ROCm installation (6.4.4+ or 7.x required)
User in render and video groups
GTT memory configuration
Python/Conda availability

ROCm 7.x Note: ROCm 7.2+ offers significant performance improvements (~2.5x in BF16 compute). However, hipBLASLt is not fully optimized for gfx1151 yet and falls back to hipBLAS.

If any checks fail, see .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md for detailed fix instructions.

Setup Process

Step 1: System Verification

Run the verification script:

cd .claude/skills/strix-halo-setup
./scripts/verify_system.sh

Expected output:

✓ AMD GPU detected
✓ ROCm installed
✓ User in render/video groups
✓ GTT configured (or warning if not)

If issues found, follow the script's instructions to fix them.

Step 2: Determine Project Name and Backend

Ask the user for:

Project name: If not specified, use strix-ml-project
Backend choice: PyTorch (training/custom code) or Vulkan (inference only)

Use the AskUserQuestion tool:

Question 1: "What would you like to name your project?" Question 2: "Which backend do you want to set up?"

PyTorch with ROCm: For training, custom code, full ML framework (supports transformers, etc.)
Vulkan: For inference only (llama.cpp, Ollama) - simpler setup, often faster

If PyTorch is chosen, continue with steps below. If Vulkan, skip to Vulkan setup section at the end.

Step 3: Create Environment

Using Conda (Recommended):

# Create new environment with Python 3.14 (or 3.13)
conda create -n {project_name} python=3.14 -y
conda activate {project_name}

Using uv (Alternative):

# Create new environment with Python 3.14 (or 3.13)
uv venv {project_name} --python 3.14
source {project_name}/bin/activate

Step 4: Install PyTorch (Community Build)

CRITICAL: Must use community builds, not official wheels.

Option 1: AMD Nightlies (Recommended)

pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchvision torchaudio

As of January 2026, this provides PyTorch 2.11.0a0+ with ROCm 7.11.0 support.

Option 2: TheRock Builds (Alternative) Pre-built wheels from the official ROCm TheRock project: https://github.com/ROCm/TheRock/releases

Look for gfx1151 releases and install with pip install <wheel_file>.

Verify Installation:

python -c "import torch; print('PyTorch:', torch.__version__); print('HIP:', torch.version.hip)"

Should show PyTorch 2.11+ and HIP 7.2+ with ROCm 7.x.

Step 5: Configure Environment Variables

Create activation script in the conda environment:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d

cat > $CONDA_PREFIX/etc/conda/activate.d/strix_halo_env.sh << 'EOF'
#!/bin/bash

# Core ROCm settings for Strix Halo (gfx1151)
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export PYTORCH_ROCM_ARCH=gfx1151

# Unified Memory Configuration - CRITICAL for accessing full memory
export HSA_XNACK=1
export HSA_FORCE_FINE_GRAIN_PCIE=1

# Memory allocation settings
export GPU_MAX_HEAP_SIZE=100
export GPU_MAX_ALLOC_PERCENT=100

# Device visibility
export ROCR_VISIBLE_DEVICES=0
export HIP_VISIBLE_DEVICES=0

# Performance optimizations
export ROCBLAS_USE_HIPBLASLT=1
export AMD_LOG_LEVEL=0
export HSA_CU_MASK=0xffffffffffffffff

# ROCm 7.x stability fixes for APUs
export HSA_ENABLE_SDMA=0  # Prevents checkerboard artifacts in VAE decodes

# PyTorch memory management (recommended for 32GB+ workloads)
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"

echo "✓ Strix Halo environment variables set"
EOF

chmod +x $CONDA_PREFIX/etc/conda/activate.d/strix_halo_env.sh

Create deactivation script:

mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

cat > $CONDA_PREFIX/etc/conda/deactivate.d/strix_halo_env.sh << 'EOF'
#!/bin/bash

unset HSA_OVERRIDE_GFX_VERSION PYTORCH_ROCM_ARCH HSA_XNACK HSA_FORCE_FINE_GRAIN_PCIE
unset GPU_MAX_HEAP_SIZE GPU_MAX_ALLOC_PERCENT ROCR_VISIBLE_DEVICES HIP_VISIBLE_DEVICES
unset ROCBLAS_USE_HIPBLASLT AMD_LOG_LEVEL HSA_CU_MASK HSA_ENABLE_SDMA PYTORCH_HIP_ALLOC_CONF
EOF

chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/strix_halo_env.sh

Step 6: Create Project Structure

mkdir -p {project_name}/{scripts,notebooks,data,models,tests}
cd {project_name}

Step 7: Copy Test Scripts

Copy the test scripts from the skill directory:

cp .claude/skills/strix-halo-setup/scripts/*.py scripts/
chmod +x scripts/*.py

Step 8: Create Project README

Create a README with project-specific information:

cat > README.md << 'EOF'
# {Project Name}

PyTorch project optimized for AMD Strix Halo (gfx1151).

## Environment

- **Hardware**: AMD Strix Halo (gfx1151)
- **ROCm**: 6.4.2+
- **PyTorch**: Community build for gfx1151
- **Python**: 3.12

## Setup

```bash
# Activate environment
conda activate {project_name}

# Verify GPU
python scripts/test_gpu_simple.py

# Test memory capacity
python scripts/test_memory.py

Hardware Capabilities

Compute: ~7 TFLOPS FP32, ~31 TFLOPS BF16 (with ROCm 7.x)
Memory: Up to 113GB GPU-accessible (with GTT configuration)
Model Capacity: 30B parameter models in FP16

Best Practices

Use BF16 for 1.6x speedup over FP32
Keep batch size small (1-4) for inference
Data in VRAM is faster than GTT memory
Monitor memory: rocm-smi --showmeminfo gtt

Troubleshooting

If compute fails with "HIP error: invalid device function":

You're using official PyTorch wheels (don't work with gfx1151)
Reinstall: pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch

For more help, see .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md

Created: {date} EOF


### Step 9: Verify Installation

Reactivate the environment to load variables:

```bash
conda deactivate
conda activate {project_name}

# Should see: "✓ Strix Halo environment variables set"

Run verification:

python scripts/test_gpu_simple.py

Expected output:

============================================================
STRIX HALO GPU TEST
============================================================
✓ GPU detected: AMD Radeon Graphics
  Memory: 113.2 GB
  Compute test successful
✓ ALL TESTS PASSED
============================================================

Step 10: Final Summary

Tell the user:

✓ Setup complete! Your Strix Halo environment is ready.

Project: {project_name}
Location: {full_path}

Next steps:
  1. Test GPU: python scripts/test_gpu_simple.py
  2. Test memory: python scripts/test_memory.py

Hardware capabilities:
  - 7 TFLOPS FP32 / 31 TFLOPS BF16 (ROCm 7.x)
  - 113 GB GPU-accessible memory
  - Can run 30B parameter models in FP16

Activate anytime with: conda activate {project_name}

Success Criteria

All of these should pass:

✓ PyTorch detects GPU
✓ Compute operations succeed (no HIP errors)
✓ Can allocate 30GB+ memory
✓ BF16 operations work

Common Issues

Issue: "HIP error: invalid device function"

Cause: Using official PyTorch wheels (don't work with gfx1151)

Solution:

pip uninstall torch torchvision torchaudio
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchvision torchaudio

Verify it worked:

python -c "import torch; a=torch.tensor([1.0]).cuda(); print('✓ Works:', (a+1).item())"

Issue: Out of memory below 30GB

Cause: GTT not configured (limited to ~33GB)

Solution 1: Upgrade to kernel 6.16.9+ (no configuration needed)

Solution 2: For older kernels, configure GTT:

.claude/skills/strix-halo-setup/scripts/configure_gtt.sh

This adds kernel parameters to GRUB for GPU to access more system RAM.

Issue: GPU not detected

Cause: User not in render/video groups

Solution:

sudo usermod -aG render,video $USER
# Log out and back in (or reboot)
groups | grep -E "render|video"  # Verify

Vulkan Setup (Alternative to PyTorch)

If the user chose Vulkan for inference-only workloads:

Step V1: Install Vulkan Drivers

sudo apt install mesa-vulkan-drivers vulkan-tools

Step V2: Verify Vulkan

vulkaninfo | grep "deviceName"
# Should show: AMD Radeon Graphics or similar

Step V3: Install Inference Tools

For llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_VULKAN=1

For Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step V4: Test Vulkan

# With llama.cpp
./llama-cli -m /path/to/model.gguf -ngl 99 --gpu-backend vulkan

# With Ollama
ollama run llama2

Vulkan Summary

Tell the user:

✓ Vulkan setup complete!

Backend: Vulkan (inference only)
Use with: llama.cpp, Ollama, other Vulkan-enabled tools

Note: Vulkan often provides better performance for inference than ROCm/HIP.
For training or custom PyTorch code, set up PyTorch instead.

References

Troubleshooting: .claude/skills/strix-halo-setup/docs/TROUBLESHOOTING.md
GTT Configuration: .claude/skills/strix-halo-setup/docs/GTT_MEMORY_FIX.md
Community PyTorch: https://github.com/ROCm/TheRock/releases

Notes

GTT configuration needed for 30B+ models on kernels before 6.16.9 (kernel 6.16.9+ has automatic UMA support)
Vulkan backend often provides better performance for inference
Use BF16 precision in PyTorch for better performance

ROCm 7.x Considerations

Benefits of ROCm 7.2+:

Up to 5x performance improvement in image generation (ComfyUI, Flux, SDXL)
~2.5x improvement in BF16 compute (31 TFLOPS vs 12 TFLOPS on ROCm 6.x)
Unified Windows/Linux release
Better long-term support path
JAX 0.8.0 support

Known Limitations on gfx1151 (as of ROCm 7.2):

hipBLASLt: Falls back to hipBLAS (slower) due to unsupported architecture
LLM Decode: Memory-copy bound (~92-95% time in hipMemcpyWithStream), limiting token throughput
Flash Attention: May require TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and still has issues

Deprecations in ROCm 7.2:

HIPCC is deprecated; AMD Clang should be used directly for compilation
ROCTracer, ROCProfiler, rocprof, rocprofv2 are deprecated; use ROCprofiler-SDK and AMD SMI instead

Recommended Environment Variables for ROCm 7.x:

export HSA_ENABLE_SDMA=0  # Prevents artifacts in VAE decodes
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"