Making an NVIDIA eGPU Actually Work on Linux (The Hard Way)

I have a Framework Laptop 13 (Intel 13th gen) and an RTX 3070 sitting in a Thunderbolt 3 eGPU enclosure. On Windows it just works. On Linux, I got this:

NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)

The GPU was right there on the PCI bus. The driver loaded. And then it gave up because BAR1 (the 256MB framebuffer aperture the GPU needs to function) had zero bytes allocated. A 220W GPU reduced to a very expensive space heater.

I spent the better part of a weekend on this. Here is what I found.

The setup#

Laptop: Framework 13, Intel i7-1370P (Raptor Lake), Thunderbolt 4
eGPU: RTX 3070 (GA104) in a Thunderbolt 3 enclosure (Intel Titan Ridge)
OS: Arch Linux, kernel 6.19, KDE Plasma 6.6 on Wayland
Monitor: 4K display plugged into the eGPU’s HDMI port

Important detail: I don’t always have the eGPU connected. Sometimes I work from the couch, sometimes from the desk. Whatever fix I came up with had to not break normal laptop usage without the GPU.

Reading the error (for once, actually reading it)#

BAR1 is 0M is pretty explicit. The GPU’s Base Address Register 1 (a 256MB chunk of prefetchable memory that maps the framebuffer) was not allocated any address space at all. Let me check:

$ sudo lspci -s 04:00.0 -vv | grep -E 'Region|BAR'
Region 0: Memory at 7c000000 (32-bit, non-prefetchable) [size=16M]
Region 1: [virtual] (64-bit, prefetchable)              # ← gone
Region 3: Memory at 6010000000 (64-bit, prefetchable) [size=32M]

BAR0 and BAR3, fine. BAR1, just… not there. And dmesg confirms:

pci 0000:04:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can't assign; no space

No space. On a machine with 64 gigs of RAM. Right.

Where did the address space go?#

So PCI devices don’t get addresses directly: they inherit them from their parent bridge, which inherits from its parent, up to the root port. Every bridge has a “window” that constrains how much space it can hand to children.

I spent a while staring at lspci -vv on every bridge in the chain. Here is what the Thunderbolt topology looks like:

00:07.0 (Root Port, TB4)               pref window: 512MB
  └─ 02:00.0 (Upstream TB3 Switch)     pref window: 417MB
       ├─ 03:01.0 (Downstream → GPU)   pref window: 161MB  ← NOT ENOUGH
       │    ├─ 04:00.0  RTX 3070       needs: 256+32 = 288MB
       │    └─ 04:00.1  GPU Audio
       ├─ 03:02.0 (Downstream → USB)   USB controller
       └─ 03:04.0 (Downstream → nothing) pref window: 128MB  ← WASTED

There it is. The GPU needs 288MB (BAR1 + BAR3). Its bridge has only 161MB. And the daisy-chain port (03:04.0, with absolutely nothing connected to it) is sitting on 128MB of prefetchable space doing nothing.

The kernel divides the upstream bridge’s window equally among the three downstream ports. Each gets roughly a third of 417MB. Mathematically fair, practically useless.

(If you’re planning to daisy-chain a second Thunderbolt device, that 128MB reservation makes sense. I’m not. I have one cable going to one box with one GPU in it.)

Trying all the kernel parameters#

Before I understood the topology, I tried everything I could find on the Arch Wiki, the eGPU.io forums, random Reddit threads:

pci=realloc: the internet says this fixes BAR allocation. It does, but only during boot. Thunderbolt hotplug is a different code path and realloc does not apply there.
pci=hpmmioprefsize=2G: reserves more prefetchable space for hotplug bridges. Great idea, except the root port firmware window is only 512MB and this parameter cannot make it bigger. I tried 2G, 4G, 32G. Same result.
pci=hpmemsize=128M: same story but for non-prefetchable. Irrelevant here.

I also checked every BIOS setting on the Framework. There is no PCIe MMIO size option, no Resizable BAR toggle, no Thunderbolt memory allocation knob. Above 4G Decoding is enabled by firmware (you can tell because addresses are at 0x6000000000) but you cannot tweak it. Framework’s Insyde BIOS is not exactly a playground for PCI tuning.

At this point I was pretty frustrated. The GPU was right there, the driver was right there, and 128MB of address space was being wasted on an empty port that I will never use.

The remove-and-rescan trick#

The actual fix, once I understood the problem, is crude but effective: remove the empty bridges from the PCI bus, remove the GPU bridge, then rescan from the root port. When the kernel rescans, it only sees one downstream bridge that needs prefetchable space, so it gives it everything. 288MB, exactly what the GPU needs.

I wrote a quick script, ran it manually, and… nvidia-smi worked. First time in two days.

Then I rebooted and it stopped working again.

The initramfs trap (the thing that cost me another half day)#

My script was running, doing the remove+rescan, and the BAR allocation would succeed. But then something would reset everything about 16 seconds later. I could see it in dmesg:

[  0.87] pci 0000:04:00.0: BAR 1 [mem 0x6000000000-0x600fffffff 64bit pref]  ← OK!
[ 16.72] pci 0000:02:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 16.73] pci 0000:04:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can't assign  ← GONE

Something at second 16 was re-enumerating the entire Thunderbolt bus. The bridge windows were getting reset to zero and the kernel was doing the allocation from scratch, badly, again.

I spent hours on this before I looked at my own mkinitcpio.conf:

MODULES=(thunderbolt)

I had put thunderbolt in the initramfs modules at some point (I honestly don’t remember why, probably something about the Thunderbolt dock I used to have). This caused the TB controller to initialize very early during boot, enumerate the PCI bus, allocate everything correctly. Then, 16 seconds later, the Thunderbolt hotplug system would kick in and do it all over again, resetting the bridge windows.

Two enumerations, one good, one bad. And my script was running either before, between, or after them depending on how fast systemd felt like starting things. A timing race I could not win.

The fix:

# /etc/mkinitcpio.conf
MODULES=()

Remove thunderbolt, rebuild initramfs with mkinitcpio -P, reboot. Now the TB module loads once via udev, the enumeration happens once, and my script has a predictable state to work with.

I felt stupid for not checking this earlier. But honestly, who looks at their initramfs module list when debugging PCI BAR allocation?

The script and the service#

With the double-enumeration out of the way, here is the full setup. First, blacklist nvidia so it doesn’t try to load before the fix runs (it would see BAR1 at zero and give up permanently):

# /etc/modprobe.d/nvidia-egpu.conf
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia_uvm

Then a systemd service that runs before SDDM:

# /etc/systemd/system/egpu-bar-fix.service
[Unit]
Description=Fix NVIDIA eGPU BAR1 allocation after Thunderbolt hotplug
After=bolt.service systemd-udevd.service
Before=display-manager.service
Wants=bolt.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/egpu-bar-fix.sh
RemainAfterExit=yes
TimeoutStartSec=90

[Install]
WantedBy=display-manager.service

And the script. It discovers everything dynamically: no hardcoded PCI addresses, so it should work on different hardware with different topologies:

#!/bin/bash
#
# egpu-bar-fix - Fix NVIDIA eGPU BAR1 allocation on Thunderbolt

set -euo pipefail

log() { logger -t egpu-bar-fix "$*"; echo "$*"; }
die() { log "FATAL: $*"; exit 1; }

find_nvidia_gpu() {
    local dir
    for dir in /sys/bus/pci/devices/*/; do
        [ "$(cat "$dir/vendor" 2>/dev/null)" = "0x10de" ] \
            && [ "$(cat "$dir/class" 2>/dev/null | cut -c1-6)" = "0x0300" ] \
            && basename "$dir" && return
    done
    return 1
}

pci_class() { cat "/sys/bus/pci/devices/$1/class" 2>/dev/null | cut -c1-6; }
is_bridge() { [ "$(pci_class "$1")" = "0x0604" ]; }

parent_bridge() {
    local real=$(readlink -f "/sys/bus/pci/devices/$1")
    local parent=$(basename "$(dirname "$real")")
    is_bridge "$parent" 2>/dev/null && echo "$parent"
}

sibling_bridges() {
    local real=$(readlink -f "/sys/bus/pci/devices/$1")
    local bdf child
    for child in "$(dirname "$real")"/0000:*; do
        bdf=$(basename "$child")
        [ "$bdf" != "$1" ] && is_bridge "$bdf" 2>/dev/null && echo "$bdf"
    done
}

bridge_is_empty() {
    local f
    for f in "/sys/bus/pci/devices/$1"/0000:*/class; do
        [ -f "$f" ] && [ "$(cat "$f" | cut -c1-6)" != "0x0604" ] && return 1
    done
    return 0
}

bar1_ok() {
    local bar1=$(sed -n '2p' "/sys/bus/pci/devices/$1/resource" 2>/dev/null)
    ! echo "$bar1" | grep -q '^0x0000000000000000'
}

# --- Main ---

log "looking for NVIDIA eGPU..."

gpu=$(find_nvidia_gpu) || gpu=""
if [ -z "$gpu" ]; then
    for _ in $(seq 1 5); do sleep 1; gpu=$(find_nvidia_gpu) && break; done
fi
[ -z "$gpu" ] && { log "no NVIDIA GPU found - nothing to do"; exit 0; }

log "found GPU at $gpu"

if bar1_ok "$gpu"; then
    log "BAR1 already assigned - loading driver"
    modprobe nvidia_drm modeset=1
    exit 0
fi

log "BAR1 is unassigned - fixing..."

gpu_bridge=$(parent_bridge "$gpu") || die "cannot find GPU parent bridge"

for sib in $(sibling_bridges "$gpu_bridge"); do
    bridge_is_empty "$sib" && {
        log "removing empty bridge $sib"
        echo 1 > "/sys/bus/pci/devices/$sib/remove" 2>/dev/null || true
    }
done

log "removing GPU bridge $gpu_bridge"
echo 1 > "/sys/bus/pci/devices/$gpu_bridge/remove"
sleep 1

upstream=$(parent_bridge "$gpu_bridge") || true
root=${upstream:+$(parent_bridge "$upstream")}
if [ -z "$root" ]; then
    for rp in /sys/bus/pci/devices/0000:00:07.?; do
        [ -d "$rp" ] && root=$(basename "$rp") && break
    done
fi
[ -z "$root" ] && die "cannot find root port"

log "rescanning from $root"
echo 1 > "/sys/bus/pci/devices/$root/rescan"
sleep 3

gpu=$(find_nvidia_gpu) || die "GPU not found after rescan"
bar1_ok "$gpu"           || die "BAR1 still unassigned after fix"

log "BAR1 OK for $gpu - loading driver"
modprobe nvidia_drm modeset=1
log "done"

When the eGPU is not connected, the script exits in about 5 seconds (“no NVIDIA GPU found”) and SDDM starts normally on the Intel GPU.

When connected, the full fix takes about 9 seconds (remove bridges, rescan, load nvidia) and then SDDM starts with the GPU ready.

One thing to be careful about: during the remove phase, do NOT remove the upstream bridge (02:00.0). I made this mistake early on and my external monitor went black instantly. The monitor uses Intel’s DisplayPort tunneling through the Thunderbolt chain, and removing the upstream bridge kills the tunnel. I had to hard-reboot. Only remove the GPU’s direct parent bridge and empty siblings.

Getting KDE to composite on the right GPU#

So nvidia-smi works now, great. But the desktop still felt sluggish. Not terrible, but not the smoothness I expected from a 3070 driving a single 4K display.

I checked what KWin was actually using:

$ qdbus org.kde.KWin /KWin org.kde.KWin.supportInformation | grep renderer
OpenGL renderer string: Mesa Intel(R) Iris(R) Xe Graphics (RPL-P)

Intel. KWin was rendering every frame on the integrated GPU and then copying it to the NVIDIA card for display output. Every single frame crossing the Thunderbolt link. Twice. No wonder it felt off.

The fix is a small env script that KDE sources at login:

# ~/.config/plasma-workspace/env/kwin-gpu.sh
if [ -e /dev/dri/card0 ] && \
   [ "$(cat /sys/class/drm/card0/device/vendor 2>/dev/null)" = "0x10de" ]; then
    export KWIN_DRM_DEVICES=/dev/dri/card0:/dev/dri/card1
fi

KWIN_DRM_DEVICES tells KWin which GPU to use for rendering: the first device in the list wins. The if block checks that card0 is actually the NVIDIA card, so when the eGPU is disconnected and card0 is Intel, the variable is never set and KWin does its default thing.

After re-login:

OpenGL renderer string: NVIDIA GeForce RTX 3070/PCIe/SSE2

That’s more like it. 4K compositing on the discrete GPU, frames going straight from render to display, no unnecessary copies.

Things I have not tested yet#

I should be honest about what I don’t know:

Suspend/resume: I have no idea if the BAR allocation survives a suspend cycle. It almost certainly doesn’t. If your workflow involves closing the lid and reopening, you might need additional udev rules or a resume hook. I always shut down when I unplug the eGPU.
Hot-unplug: pulling the Thunderbolt cable while the GPU is active is a gamble on Linux. KWin handles it somewhat gracefully (falls back to Intel) but I wouldn’t rely on it. I stop KDE, unload nvidia, then disconnect.
Different enclosures: the script discovers the topology dynamically, so it should work with different TB enclosures and GPUs. But I’ve only tested my Titan Ridge + RTX 3070 setup.

The files#

Everything in one place:

File	What it does
`/etc/mkinitcpio.conf`	`MODULES=()`, removed `thunderbolt`
`/etc/modprobe.d/nvidia-egpu.conf`	Blacklists nvidia (loaded by the script instead)
`/usr/local/bin/egpu-bar-fix.sh`	The remove+rescan fix
`/etc/systemd/system/egpu-bar-fix.service`	Runs before SDDM
`~/.config/plasma-workspace/env/kwin-gpu.sh`	Tells KWin to render on NVIDIA
`/boot/refind_linux.conf`	`pci=realloc` in kernel params

After creating these:

sudo mkinitcpio -P
sudo systemctl daemon-reload
sudo systemctl enable egpu-bar-fix.service

Is this a kernel bug?#

Probably, yes. The hotplug code (pci_bus_distribute_available_resources() in drivers/pci/setup-bus.c, if you want to look) divides space equally among child bridges without checking if any of them actually have devices. It’s being conservative (a device could appear on any port later), but for the common eGPU case where only one port has anything connected, it wastes too much space.

There are discussions in the kernel mailing list about smarter resource allocation for hotplug, but Thunderbolt eGPUs are niche enough that nobody has pushed a fix upstream. So for now, 80 lines of bash will have to do.

The script and config files are on GitHub. If you’re fighting the same problem or a variation of it, find me on LinkedIn, always happy to compare dmesg output with fellow eGPU masochists.