AI Toolkit and Rocky LInux 9 and RTX 6000 Pro Blackwell

Yay I got it working. Boo I wish I had the last 5 days of my life back.

My pain = your gain.

Lame on, Lame on.

Lenovo P8 79765wx

NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8

RTX 6000 Pro Blackwell

# AI Toolkit / PyTorch Blackwell Installation Guide

## 🧾 Summary of What Happened

### Symptoms
```
ImportError: /usr/local/cuda/lib64/libcusparse.so.12: undefined symbol: __nvJitLinkCreate_12_8
```
Torch was loading mismatched CUDA system libraries. We resolved this by ensuring all CUDA libraries came from the venv’s own NVIDIA Python packages.

### Final Verified Setup
```
torch: 2.8.0+cu128 | CUDA: 12.8
GPU: NVIDIA Graphics Device (CC 12.0)
✅ Matmul OK
```

---

## 🧠 Root Cause
Multiple CUDA versions and dangling symlinks in `/usr/local/cuda-*` conflicted. PyTorch’s JIT linker (`libnvJitLink.so.12`) was missing or mismatched. Installing `nvidia-nvjitlink-cu12` and cleaning `ld.so.conf.d` fixed the issue.

---

## 🧰 Reproducible Installation Procedure

### 0️⃣ Prereqs
- OS: Rocky Linux 8/9 x86_64  
- GPU: Blackwell (GB202 / RTX 5090 / B100 / etc.)  
- NVIDIA Driver: **570.86.10** or newer

---

### 1️⃣ Install NVIDIA Driver + CUDA Toolkit

```bash
sudo dnf remove -y cuda* nvidia-driver*
sudo dnf clean all
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf install -y nvidia-driver:570.86.10 cuda-toolkit-12-8
sudo reboot
```

Verify:
```bash
nvidia-smi
# Should show CUDA Version: 12.8 and Driver Version: 570.86.10+
```

---

### 2️⃣ Create a Python Environment

```bash
python3 -m venv ~/venvs/ai-toolkit
source ~/venvs/ai-toolkit/bin/activate
pip install --upgrade pip setuptools wheel
```

---

### 3️⃣ Install PyTorch CUDA 12.8 Build

```bash
pip install torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128   --extra-index-url https://download.pytorch.org/whl/cu128
```

---

### 4️⃣ Verify CUDA Runtime Alignment

```bash
python - <<'PY'
import torch, torch.cuda as cu
print("Torch:", torch.__version__, "| CUDA:", torch.version.cuda)
print("GPU:", cu.get_device_name(0))
x = torch.randn(1024,1024,device='cuda'); y = torch.randn(1024,1024,device='cuda')
z = x@y; cu.synchronize(); print("✅ Matmul OK:", z.shape)
PY
```

---

### 5️⃣ Confirm NVIDIA Library Resolution

```bash
python - <<'PY'
import torch._C as C, os
os.system(f"ldd {C.__file__} | egrep 'nvJitLink|cusparse|cublas|cudart' || true")
PY
```
Expected paths:
```
.../venv/lib64/python3.12/site-packages/nvidia/...
```

---

### 6️⃣ Optional — GPU Cooling Automation (X11 only)

```bash
sudo tee /etc/X11/xorg.conf.d/20-nvidia-coolbits.conf >/dev/null <<'EOF'
Section "Device"
    Identifier "NVIDIA0"
    Driver     "nvidia"
    Option     "Coolbits" "28"
    Option     "AllowEmptyInitialConfiguration" "True"
EndSection
EOF

sudo systemctl restart display-manager
```
Then, inside your X session:
```bash
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=85"
```

---

### 7️⃣ Optional — Background Launcher for AI Toolkit UI

Create `/usr/local/bin/run-ai-toolkit`:
```bash
#!/bin/bash
cd /scratch/workspace/GitHub/ai-toolkit/ui || exit 1
nohup npm run build_and_start >/tmp/ai-toolkit-ui.log 2>&1 &
disown
echo "🚀 AI Toolkit UI started in background (log: /tmp/ai-toolkit-ui.log)"
```
Make executable:
```bash
sudo chmod +x /usr/local/bin/run-ai-toolkit
```

---

## ✅ Final Checklist

| Step | Goal | Command |
|------|------|----------|
| Driver | Confirm 570+ | `nvidia-smi` |
| CUDA Toolkit | 12.8 | `nvcc --version` |
| PyTorch | 2.8.0+cu128 | install via pip |
| Verify | Run matmul | ✅ |
| GPU Fans | optional | X11 + Coolbits |
| Launch UI | run-ai-toolkit | runs in background |

---

## 📦 Optional Next Step
If you'd like, you can automate the entire setup with a single `setup-ai-gpu.sh` installer script.

1 Like

One thing I do when fucking around with things like this is run a VM via Proxmox and pass thru the GPU. Then you can take snapshots of the VM along the way, and just roll back when you break it. Really easy to test and optimize things without having to sit there an install a fresh OS each time. Teradici works fine within that environment too.

2 Likes

But I like headaches, splinters, flicking myself on the inner thigh, stepping on legos, and not going outside.

4 Likes

dont you go a-changing @randy

2 Likes

So how do you implement the system you tested and perfected on a VM in proxmox to a standalone box somewhere as your flame? How do you save that VM and install it as is on a separate box?

@theoman - if your vm is the adsk rocky os and you keep a track of what you installed and configured, its fairly simple to recreate on bare metal.

one of, if not the first thing that i do these days is:

echo 'export HISTTIMEFORMAT="%F %T "' >> ~/.bashrc \
&& echo 'export HISTSIZE=10000 HISTFILESIZE=10000' >> ~/.bashrc \
&& source ~/.bashrc

you can try much more exotic commands to log your commands, your prompts and stdout and stderr.

these days you can pipe the results all of that into a large language model and request shell or python scripts to recreate your steps with logs and confirmations.

when you’re happy with that you can convert everything to ansible.

Woah. ok that is totally above my head rn until i actually do it and find out how that works. The next time i need to test things out on a new build ill look into that. thanks phil

@theoman - if you are running rocky linux right now you can open a terminal, and as long as you have not changed your default shell, then you can type history and it will list the commands that you have used.

the command in the above post will add a date/timestamp to your history so you can isolate when you made that command, and it changes the default history size to the last 10,000 commands.

search the internet for bash history and there are plenty of guides, or better yet, start a conversation with a large language model:

> "explain bash shell history to a novice"

you can also ask a large language model:

> "explain macOS zsh shell history to a novice"

> "help me design a program and a shell command that will record my commands and the stdout and stderr into a log file

1 Like

oh awesome. so its saving them all the time and then you can copy and paste weed out the ones you dont want and create scripts to automate them on another machine. When i get some time ill look into this. Im deep into IT hell figuring out why my 25Gbe NIC is not playing nicely on my proxmox. Simply crazy how one issue leads to another leads to another. The NIC wont come up but its visible, the drivers seem to be latest but might need to be wound back, getting an ISO or a tar on the local proxmox is easy but then im getting cant open block dev errors when trying to mount said ISO. i mean its endless sometimes! i will persevere!

if you work out logging then you can just feed all the gibberish to a machine and it will help you work out how to not fuck things up in the future, hilariously enough, most likely with a virtual machine.

we no longer live in a world where engineers or it administrators tell you what you can and cannot do based on their ignorance of flame.

it’s time to remind them of the fact that that the picture makers make the money, not the folder makers…

mkdir -p -m 777 My_special_projekt is not something that clients want to pay for…

I would love to build a VM on my proxmox cluster but I am so jammed up with time im kinda building it as I fly it.

Once I retire an olde P620 I’ll chuck an old gpu in it, pass it through, and add it to my PVE cluster. Until then, I suffer like a fool.

1 Like

@randy
the script will probably look something like:

#!/bin/bash

# Define VM parameters
VMID=101
VMNAME="gpu-vm"
ISO_PATH="~/Downloads/Rocky-9.5-x86_64-dvd-ADSK_REV_001.iso"
DISK_PATH="local-lvm"
GPU_PCI="0000:01:00.0"  # Replace with your GPU PCI ID
GPU_AUDIO="0000:01:00.1"  # Optional: GPU audio device
CORES=32
MEMORY=131072

# Create VM
qm create $VMID \
  --name $VMNAME \
  --memory $MEMORY \
  --cores $CORES \
  --net0 virtio,bridge=vmbr0 \
  --ostype l26 \
  --boot order=ide2 \
  --ide2 $DISK_PATH:cloudinit \
  --scsihw virtio-scsi-pci \
  --scsi0 $DISK_PATH:32 \
  --serial0 socket \
  --vga none \
  --machine q35 \
  --cpu host

# Attach ISO
qm set $VMID --cdrom $ISO_PATH

# Enable UEFI boot
qm set $VMID --bios ovmf
qm set $VMID --efidisk0 $DISK_PATH:1,format=raw

# Add GPU passthrough
qm set $VMID --hostpci0 $GPU_PCI,pcie=1,x-vga=1
qm set $VMID --hostpci1 $GPU_AUDIO,pcie=1  # Optional

# Start VM
qm start $VMID

echo "VM $VMNAME with GPU passthrough created and started."

obviously, the nice thing is that once you get the formula correct for the vm you can create a template.