htdocs/notes/AMD-AGI.md

Here is how to get popular AI software running on a recent RDNA or RDNA2 AMD card. This method uses distrobox so it should be the same for every Linux distribution. 

There is always a chance these instructions would be deprecated in a few months.

## Requirements

### Hardware

All RDNA iterations should work possibly with slightly different steps and versions. Polaris and Vega are also known to work but some parameters might be different. I can only test these results on my RX 6800m.

### Software

The main advantage AMD has over Nvidia on Linux is the mainline open source driver working out of the box. Installing amdgpu-pro defats the purpose so you might as well buy Nvidia if you are to use binary blobs.

I recommend a recent kernel in order to make use of the drivers available on mainline kernel and avoid using the proprietary amdgpu-pro driver. The one included on Ubuntu 22.04 LTS should suffice which is around 5.4.14 but chances are you are on 6.x.

The first step is to install [distrobox](https://github.com/89luca89/distrobox). On Nixos I've achieved that by adding the distrobox package to my user list of packages. If distrobox is not available on your package manager follow the instructions on the project page to get it installed.

On Fedora the package [toolbox](https://docs.fedoraproject.org/en-US/fedora-silverblue/toolbox/) works in a similar fashion. In fact both programs run on top of Podman rootless containers.

This article has been created on a system running the current version of Nixos (23.11). It shouldn't matter what you are running as long as you have a proper kernel and can run Podman rootless containers.

### Access rights

Your user must have access to /dev/kfd and /dev/dri/* devices. Normally these belong to root user and render group. This is done by default on most desktop distributions but if you are not on the render group add yourself to it and restart your session.

Find out which group owns /dev/kfd:
```
$ ls -l /dev/kfd
crw-rw-rw- 1 root render 242, 0 Sep 13 15:35 /dev/kfd
```

Find out if I belong to group render.

```
$ groups
users wheel video render
```

I'm case I'm not, assuming my username is iru I can do this by:

```
$ gpasswd -a iru render
```

Then just exit desktop or ssh session and log in again. A reboot is not needed.

## Preparing environment and installing libraries

### Initializing distrobox

First I recommend creating a separate home for your distroboxes. The reason behind it being some compiler environment variables you might not want to mix with your main system. Here I'm going to use ~/Applications/rocm6.


```bash
mkdir -p ~/Applications/rocm6
```


You will be asked confirmation for the commands bellow. 

```
distrobox create -n rocm6 --home /home/iru/Applications/rocm6 --image docker.io/library/ubuntu:22.04
```

Now we can enter the distrobox container:

```
$ distrobox enter rocm6
Starting container rocm

...

Container Setup Complete!
```

You are now inside a rootless (user level, non administrative) container integrated with your home folder and system devices. You can access your files and folders but with no visibility to the world outside your home folder. The operating system is now Ubuntu 22.04 regardless of the underlying distribution.

* You can sudo and install programs into this toolbox using apt.
* The root superuser inside this environment is not the same as your system root user.
* Programs installed through the system do not affect your underlying system.
* Deleting the distrobox will have no affect on your system.
* Your home is still mapped therefore if you delete your personal files those will be gone for good.
* Programs installed locally through your user folder (i.e. Go, Rust stuff) might work as normal.
* Since we opted to set ~/Applications/rocm6 as the home folder the dot files will be separate from your main system. 

Regardless of how you've arrived here, you should have an Ubuntu 22.04 environment with access to /dev/kfd and /dev/dri/* on your system.

It is a good idea to update the system before we begin installing dependencies.
```
$ sudo apt update
$ sudo apt upgrade -y
```

### Installing ROCm

Now lets install ROCm and other packages by following the instructions on the [AMD documentation site for ROCM](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/ubuntu.html).


Import the signing keys into the distrobox container.
```
$ sudo mkdir --parents --mode=0755 /etc/apt/keyrings
$ wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
...
2024-05-07 14:03:18 (2.61 GB/s) - written to stdout [3116/3116]

```

Now we register the runtime repos.
```
$ echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/6.1/ubuntu jammy main" | sudo tee /etc/apt/sources.list.d/amdgpu.list
$ echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1 jammy main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
$ echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600'| sudo tee /etc/apt/preferences.d/rocm-pin-600
$ sudo apt update
```

Install ROCm. No need to install the proprietary drivers.
```
$ apt install rocm
```

Optionally you might also want to install radeontop a useful TUI resource monitor. 

```
$ sudo apt install radeontop
```

The rocm-smi monitor:

```
$ rocm-smi
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[1]	: sclk clock is unsupported
====================================================================================
GPU[1]		: get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan     Perf  PwrCap       VRAM%  GPU%  
0    50.0c           22.0W   500Mhz  96Mhz    40.78%  auto  130.0W         0%   3%    
1    57.0c           31.0W   None    1600Mhz  0%      auto  Unsupported   77%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================
```

The rocminfo command:

```
$ rocminfo
```

Lots of devices here. I'm running one of those laptops with hybrid GPUs so it prints details about the integrated graphics unit on the Ryzen processor plus the discrete 6800m GPU. There is also a third virtual device that acts as a mux between discrete and integrated GPUs. 

In my case this is the relevant part:

```
*******                  
Agent 2                  
*******                  
  Name:                    gfx1031                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6800M
```

From my personal experience APUs don't work.

Next we need to add some environment variables.

```
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/llvm/bin/clang++
export PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH
```

Another variable that needs to be set is `HSA_OVERRIDE_GFX_VERSION`. AMD doesn't actually support ROCm on consumer grade hardware so you need this variable to force it use a certain target. Remember rocminfo? It gave me a "gfx1031" for my 6800m so going to set this variable as "10.3.0" which would apply to 1030, 1031, 1032 and so on. For non RDNA cards we need to setup values for Vega (GCN5) which would be 9.0.0. RDNA3 uses "11.0.0".

```bash
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/llvm/bin/clang++
export PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION="10.3.0"
```

Finally source the shell configuration file or exit the session and log in again and verify the environment variables are properly set.

```
$ source ~/.profile
$ echo $CC
/opt/rocm/llvm/bin/clang
$ echo $CXX
/opt/rocm/llvm/bin/clang++
```

## Installing software

Following are instructions to install Stable Diffusion webui, ooba's text generation ui and Langchain.


### Git
```
apt install git python3.10 python3.10-venv
```

### Voice Changer

```
$ sudo apt install libportaudio2
$ cd
$ git clone https://github.com/w-okada/voice-changer.git
$ cd voice-changer
$ python3 -m venv venv
$ source venv/bin/activate
$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
$ cd server
$ pip install -r requirements.txt
```

### Stable Diffusion WebUI

The most popular way to interact with Stable Diffusion is through AUTOMATIC1111's Gradio webui. Stable diffusion depends on Torch.

We start by cloning the Automatic1111 project into our home folder.

```
$ cd
$ git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
$ cd stable-diffusion-webui
```

This project changes a lot so for your sanity I recommend switching to the last release (1.6.0 at the time of this writing).

```
$ git checkout v1.6.0 
```

On Nvidia or pure CPU you just need to run webui.sh and have all the software and dependencies installed AMD requires that you define an alternative source for the libraries besides setting certain environment variables.

The file we are looking for is webui-user.sh where custom configuration can be added.

The first one is `TORCH_COMMAND` which defines how Torch should be installed. By going to [https://pytorch.org](https://pytorch.org) and scrolling down til you see an "INSTALL PYTORCH" with a selection box.

* Choose Stable (at the time of this article is 2.0.1)
* Choose Linux
* Choose Pip
* Choose ROCm (at the time of this article is 5.4.2)

You will get something like this:

```
Run this Command:  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
```

The above command should also be enough if you just want to play with PyTorch.

Torch+ROCm doesn't necessarily needs to match the ROCm version you have installed. Of course matching versions would ensure best compatibility but it seems to not always be the case. I've tried the nightly Torch+ROCm 5.6 together with matching ROCm runtime version and it couldn't detect CUDA support.

So add this to webui-user.sh and make sure it's the only occurrence of `TORCH_COMMAND`. You probably don't need torchaudio.

```
export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.4.2"
```

An optional setting is telling Torch how to manage memory. Aparently using the CUDA runtime (in our case ROCm passing as CUDA) native memory management is more efficient.

```
export PYTORCH_CUDA_ALLOC_CONF="backend:cudaMallocAsync"
```

Now the Command-line args to the program. In my experience the `--precision full --no-half --opt-sub-quad-attention` arguments will offer the best compatibility for AMD cards.

```
export COMMANDLINE_ARGS="--precision full --no-half --opt-sub-quad-attention --deepdanbooru"
```

If your card supports fp16 you can replace `--precision full --no-half` with `--upcast-sampling` for better performance and more efficient memory management. This is the case for at least RDNA2.

Here is the list of possible arguments: [https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings). A dedicated AMD session exists in [https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs) and might have up to date instructions.

So my webui-user.sh would look like this:

```
export HSA_OVERRIDE_GFX_VERSION="10.3.0"
export PYTORCH_CUDA_ALLOC_CONF="backend:cudaMallocAsync"
export COMMANDLINE_ARGS="--precision full --no-half --opt-sub-quad-attention --deepdanbooru"
export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.4.2"
```

A general tip is to copy that file somewhere else or add it to .gitignore. This file is tracked so if you run something like git reset --hard you might lose it.

Now run ./webui.sh.

```
$ ./webui.sh
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on iru user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.11 (main, Apr  5 2023, 00:00:00) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]
Version: v1.6.0
Commit hash: 5ef669de080814067961f28357256e8fe27544f4
Installing torch and torchvision
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/rocm5.4.2
Collecting torch
  Downloading https://download.pytorch.org/whl/rocm5.4.2/torch-2.0.1%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl (1536.4 MB)
```

Some installation mumbo jumbo, yellow font pip warnings and a bunch of large files downloaded later you'll see something like this:

```
Applying attention optimization: sub-quadratic... done.
Model loaded in 8.2s (calculate hash: 2.6s, load weights from disk: 0.1s, create model: 2.9s, apply weights to model: 2.1s, calculate empty prompt: 0.4s).
```

If you see some message recommending you to skip CUDA then something wrong happened, you are missing some library or don't have access to the /dev/kfd and /dev/dri/* devices. 

If everything is correct you should open your browser pointing it to http://localhost:7860 and start generating images. It needs to compile some stuff so the first image you try to generate might take a while to start. Subsequent gens will be faster.

If you encounter errors while starting or generating images related to image either adjust your parameters or set `--medvram` or `--lowvram` to `COMMANDLINE_ARGS` on webui-user.sh, at the cost of performance of course. 

OOM errors typically look like this:


```
    torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 5.06 GiB (GPU 0; 11.98 GiB total capacity; 7.99 GiB already allocated; 3.39 GiB free; 8.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
```

It must be said that not even Nvidia users can generate high res images out of the box. The general strategy is to generate small images (like 512x512 or 768x768) and then use Hires Fix to upscale it.

#### SDXL

SDXL works like any model but bigger. If you get OOM errors while trying to generate images with SDXL add `--medvram-sdxl` to `COMMANDLINE_ARGS`. Again at a performance cost.

### Llama.cpp

Starting from commit 6bbc598 llama.cpp supports ROCm. Previous versions supported AMD through OpenCL. This project also changes and breaks very fast so stick to releases.

Make sure you've followed the step where we set the environment variables specially for the compilers.

```
$ cd
$ git clone https://github.com/ggerganov/llama.cpp.git
$ cd llama.cpp
$ git checkout b1266
```

Use cmake. GNU Make compiles a binary that segfaults while loading the models.

```
$ mkdir build
$ cd build
$ CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON
$ cmake --build .
$ cp main ..
```

The variables are not really needed since we already defined in the profile but I'm writing it anyway just to stress how they are needed.

If everything went alright you should see this message at the end.

```
====  Run ./main -h for help.  ====
```

That's it. Look for scripts under examples to find out how to use the program. You need to pass the `-ngl N` argument to load a number of layers into the video RAM. A convenient feature of llama.cpp is that you can offload some of the model data into your normal generic RAM, again at the cost of speed. So if you have plenty of vram+ram and are not in a hurry you can run a 70b model.

Current llama.cpp uses a format named GGUF. You can download open source models from hugging face. 
I'm going to use [CodeLlama-13B-Instruct-GGUF](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF). 

Check the quant method table under the model card. It tells how much memory that quantization uses. As a general rule I use Q4_K_M  or  Q5_K_M. Anything bellow Q4_K_M is often weird.

If you are sure the whole model fits in your VRAM (check the quant method table) just give some ridiculous value to -ngl like 1000. If that's not the case you need to make sure you pass enough layers to not OOM. I recommend to leave a bit of space on your VRAM for some swapping otherwise it will be painfully slow. So if you have 12GB VRAM make sure to pass enough layers to stay around 10GB and offload the rest to RAM.

```
$ curl -L -o ./models/codellama-13b-instruct.Q4_K_M.gguf https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/main/codellama-13b-instruct.Q4_K_M.gguf
$ ./examples/chat-13B.sh -ngl 1000 -m ./models/codellama-13b-instruct.Q4_K_M.gguf
```


A possible hiccup if you have more than one AMD GPU device (including the APU) you might get stuck in this:

```
Log start
main: warning: changing RoPE frequency base to 0 (default 10000.0)
main: warning: scaling RoPE frequency by 0 (default 1.0)
main: build = 1269 (51a7cf5)
main: built with cc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) for x86_64-redhat-linux
main: seed  = 1695477515
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon RX 6800M, compute capability 10.3
  Device 1: AMD Radeon Graphics, compute capability 10.3
```

So we need to force it to use the proper device by passing `HIP_VISIBLE_DEVICES=N` to the script:

```
$ export HIP_VISIBLE_DEVICES=0
$ ./examples/chat-13B.sh -ngl 1000 -m ./models/codellama-13b-instruct.Q4_K_M.gguf
```

If that's the case then add `HIP_VISIBLE_DEVICES=0` to your shell profile.

If everything went fine you'll be presented with an interactive chat with a bot on your terminal. Pay attention to the messages before the prompt and look for this:

```
llm_load_tensors: VRAM used: 9014 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 1600.00 MB
llama_new_context_with_model: compute buffer total size =  381.47 MB
llama_new_context_with_model: VRAM scratch buffer: 380.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
```

Important here is how much VRAM is used and if BLAS = 1. If BLAS =0 you might be running on slower CPU AVX. If you are offloading layers (as in offloading part of a model larger than your vram to your ram) adjust the number of layers to fit in your vram, again with some slack for swapping. 

If everything is correct you'll be able to make it output bullshit really fast.

```
 Text transcript of a never ending dialog, where USER interacts with an AI assistant named ChatLLaMa.
ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer USER's requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what USER and ChatLLaMa say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

USER: Hello, ChatLLaMa!
ChatLLaMa: Hello USER! How may I help you today?
USER: What year is it?
ChatLLaMa: We are in 2023.
USER: Please tell me the largest city in Europe.
ChatLLaMa: The largest city in Europe is Moscow, the capital of Russia.
USER: What can you tell me about Moscow?
ChatLLaMa: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
USER: What is a cat?
ChatLLaMa: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
USER: How do I pass command line arguments to a Node.js program?
ChatLLaMa: The arguments are stored in process.argv.

    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
USER: Name a color.
ChatLLaMa: Blue.
USER: What time is it?
ChatLLaMa: It is 17:46.
USER: How to write hello world in ADA.
ChatLLaMa: To write "Hello World" in Ada, you need to create an empty project called "hello_world", then open your text editor (like Visual Studio Code or Sublime Text) and add the following code to it:
with Ada.Text_IO; use Ada.Text_IO; with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; procedure HelloWorld is begin Put("Hello World"); NewLine; end HelloWorld; begin HelloWorld; end;

Then save the file as "hello_world.adb". Finally run the program by opening the terminal window and typing the command "gnatmake -P hello_world.gpr" and press enter key. Then wait for few seconds until the program finishes running and shows the output result on the screen.
```

### Oobabooga's text-generation-webui and llama-cpp-python

This is a Gradio UI for LLMs. It listens on the same port as stable diffusion webui.

```
$ cd
$ git clone https://github.com/oobabooga/text-generation-webui.git
$ cd text-generation-webui
$ python3 -m venv ooba
$ source ooba/bin/activate 
# for fish run "source ooba/bin/activate.fish" instead
$ pip install -r requirements.txt
```

We need to replace the llama-cpp-python lib with one that has ROCm support.

```
$ pip uninstall llama-cpp-python
$ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install --no-cache llama-cpp-python
```

Place your model under the models folder and launch the program. As always make sure the environment variables are passed to the program.

```
HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" python server.py
```

You might not need to pass the variables if they are in your profile but I'm stressing how they are necessary here.

Point your browser to http://localhost:7860 and load the desired model from the web interface. This one can OOM if you load too many layers so you might need to actually adjust the value.

You need to activate the venv before launching it. **Having the same venv for all the programs or not using venv at all is not a good idea. These projects often use specific versions and you might end up with a dependency mess if**.

### Langchain

This follows a similar path from ooba since it uses llama-cpp-python. 

```
$ python3 -m venv langchain
$ source langchain/bin/activate 
# for fish run "source langchain/bin/activate.fish" instead
$ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install --no-cache llama-cpp-python
$ pip install langchain
```

Read [Langchain llamacpp doc](https://python.langchain.com/docs/integrations/llms/llamacpp) and start coding. Alternatively you can setup an OpenAI compatible endpoint with pure llama.cpp in server mode or llama-cpp-python.