FFmpeg under Debian with NVIDIA NVDEC (CUVID) and NVENC support

A lesser-known feature of NVIDIA GPUs is hardware video transcoding via specialized on-GPU modules: NVENC and NVDEC. Debian’s FFmpeg package has supported these since at least Debian 12 (bookworm), but using them requires enabling the non-free repository and installing the proprietary NVIDIA drivers. For more details, see the official FFmpeg tutorial.

How to use NVENC/NVDEC

Hardware acceleration is not used by default; you must explicitly set the input decoder and the output encoder on the command line. To verify that your ffmpeg binary supports NVIDIA hardware acceleration, run:

ffmpeg -hide_banner -encoders | grep -i nvidia

This lists the supported encoders (that is, the formats the output file can be encoded in):

 V....D av1_nvenc            NVIDIA NVENC av1 encoder (codec av1)
 V....D h264_nvenc           NVIDIA NVENC H.264 encoder (codec h264)
 V....D hevc_nvenc           NVIDIA NVENC hevc encoder (codec hevc)

Currently, NVENC supports only AV1, H.264, and HEVC as output formats. However, it accepts more formats as input:

ffmpeg -hide_banner -decoders | grep -i nvidia

returns:

 V..... av1_cuvid            Nvidia CUVID AV1 decoder (codec av1)
 V..... h264_cuvid           Nvidia CUVID H264 decoder (codec h264)
 V..... hevc_cuvid           Nvidia CUVID HEVC decoder (codec hevc)
 V..... mjpeg_cuvid          Nvidia CUVID MJPEG decoder (codec mjpeg)
 V..... mpeg1_cuvid          Nvidia CUVID MPEG1VIDEO decoder (codec mpeg1video)
 V..... mpeg2_cuvid          Nvidia CUVID MPEG2VIDEO decoder (codec mpeg2video)
 V..... mpeg4_cuvid          Nvidia CUVID MPEG4 decoder (codec mpeg4)
 V..... vc1_cuvid            Nvidia CUVID VC1 decoder (codec vc1)
 V..... vp8_cuvid            Nvidia CUVID VP8 decoder (codec vp8)
 V..... vp9_cuvid            Nvidia CUVID VP9 decoder (codec vp9)

Each encoder and decoder has specific options or presets; to view them:

ffmpeg -hide_banner -h encoder=hevc_nvenc
ffmpeg -hide_banner -h decoder=mpeg2_cuvid

Every encoder and decoder supports the -gpu option, which selects the GPU to use (0 by default). With nvidia-smi, you can see the available cards. The codec support matrix for different GPU models is available here.

Transcode with full hardware acceleration

To demonstrate this, I transcoded a sample MPEG2 file captured with a DVB-S tuner:

ffmpeg -y -hide_banner -hwaccel cuda -hwaccel_output_format cuda -hwaccel_device 0 -c:v mpeg2_cuvid -i input-file.ts -c:v h264_nvenc output.mp4

This is the fastest processing mode available. Decoding and encoding run in a single pipeline. The data is sent to the graphics card, decoded there, and stored in the GPU’s memory. It is then consumed by the encoder and sent back to system memory. -hwaccel_device specifies which GPU to use. The -gpu options are ignored, since the entire process runs on a single card.

Transcode with independent NVDEC and NVENC

ffmpeg -y -hide_banner -c:v mpeg2_cuvid -gpu 0 -i input-file.ts -c:v h264_nvenc -gpu 0 output.mp4

In this mode, the encoder and decoder are independent; the decoded data goes back to main memory over PCIe. It is then sent to the GPU again for encoding. The obvious disadvantage is moving the data back and forth. However, if you have two cards, you can even use different GPUs for decoding and encoding:

ffmpeg -y -hide_banner -c:v mpeg2_cuvid -gpu 0 -i input-file.ts -c:v h264_nvenc -gpu 1 output.mp4

If the input format is not supported by NVDEC, FFmpeg can use another backend, usually on the CPU. In that case, the GPU handles only the encoding:

ffmpeg -y -hide_banner -i input-file.ts -c:v h264_nvenc -gpu 0 output.mp4

Testing performance

The machine I had access to is unusual, since the two powerful graphics cards are much newer than the rest of the hardware:

CPU: AMD Ryzen 9 3900X, 32 GB RAM
GPUs: 2× NVIDIA GeForce RTX 4090
FFmpeg 7.1.2, Debian 13

A single RTX 4090 supports PCIe 4.0 x16. In this setup, however, the cards’ performance is throttled by the limited number of available PCIe lanes. This does not pose a problem for the machine’s main task, namely whisper.cpp speech-to-text processing. As shown by lspci -vv:

Speed 16GT/s, Width x8 (downgraded)
Speed 16GT/s, Width x4 (downgraded)

I ran the example transcoding in all the modes described above. Once, it ran on an otherwise idle machine; otherwise, I ran it simultaneously with CUDA-enabled whisper.cpp on graphics cards. BTW for monitoring GPU load, I recommend nvtop.

The total transcoding time can be thought of as the sum of:

Time spent encoding and decoding on the GPU.
Time spent moving data over PCIe.

The ratio of these times varies between video files. For low-resolution videos, encoding/decoding is quick, so the PCIe overhead is proportionally larger. This is noticeable in the results below.

In general, GPU transcoding is significantly faster than CPU transcoding, at over 5x. For low-resolution videos, the speedup is lower, around 2x. On an otherwise idle GPU, the differences between GPU modes are negligible, but they become more pronounced under heavy load. After all, the encoder/decoder is only a small part of the GPU. Overall, it is beneficial to use the full hwaccel mode, as it minimizes PCIe traffic.

Low-resolution video: MPEG2 to H.264

720×576, 25 fps, an MPEG2 file captured from a DVB-S tuner transcoded to H.264. The speedup over real-time video duration as reported by FFmpeg. Results were averaged over four passes.

No:	Mode description:	Command:	Speedup (idle):	Speedup (Whisper):
1	Full hwaccel	`ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v mpeg2_cuvid -i input-file.ts -c:v h264_nvenc output.mp4`	45.3	45.2
2	Independent NVDEC→NVENC	`ffmpeg -c:v mpeg2_cuvid -i input-file.ts -c:v h264_nvenc output.mp4`	44.7	18.5
3	Independent NVDEC→NVENC on separate GPUs	`ffmpeg -c:v mpeg2_cuvid -gpu 0 -i input-file.ts -c:v h264_nvenc -gpu 1 output.mp4`	45.0	19.1
4	CPU decoding, encoding with NVENC	`ffmpeg -i input-file.ts -c:v h264_nvenc -gpu 0 output.mp4`	44.2	20.4
5	Default, CPU, libx264 encoding	`ffmpeg -i input-file.ts -c:v libx264 output.mp4`	24.8

The video is low-resolution, so encoding/decoding is fast, and the time spent moving data over PCIe becomes significant. When the data stays in GPU memory (mode 1), the difference between an idle GPU and a GPU under load is negligible. However, in mode 2, performance drops by about half when Whisper is running.

4K video: H.264 to HEVC

bbb_sunflower_2160p_30fps_normal.mp4, a 4K video of the well-known Big Buck Bunny clip, downloaded from here. 3840×2160, 30 fps, H.264 transcoded to HEVC.

No:	Mode description:	Command:	Speedup (idle):	Speedup (Whisper):
1	Full hwaccel	`ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i bbb.mp4 -c:v hevc_nvenc output.mp4`	4.39	4.30
2	Independent NVDEC→NVENC	`ffmpeg -c:v h264_cuvid -i bbb.mp4 -c:v hevc_nvenc output.mp4`	4.37	4.12
3	Independent NVDEC→NVENC on separate GPUs	`ffmpeg -c:v h264_cuvid -gpu 0 -i bbb.mp4 -c:v hevc_nvenc -gpu 1 output.mp4`	4.55	4.44
4	CPU decoding, encoding with NVENC	`ffmpeg -i bbb.mp4 -c:v hevc_nvenc output.mp4`	4.35	4.43
5	Default, CPU, libx265 encoding	`ffmpeg -i bbb.mp4 -c:v libx265 output.mp4`	0.72

The video resolution is higher, so encoding/decoding takes longer. Moving data over PCIe is a much smaller fraction of the overall processing time, so the performance hit is lower when Whisper is running.

Limitations

NVDEC and NVENC are implemented in hardware as fixed-function circuits, so they are neither upgradeable nor as versatile as traditional software codecs. According to many people on the Internet, their quality is lower than that of software codecs. They were designed for real-time use, so speed wins over quality and compression efficiency. With each generation, however, the encoder gets better. Apparently, a major upgrade came with the Turing architecture (starting with GeForce RTX 2080).

Fortunately, this affects only encoding. Decoding is straightforward: given a stream in a specific format, there is only one way to render it into raw pixels. Encoding is different: there are many ways to compress raw pixels into a video stream, and CPU-based encoders are generally better at it.

With NVENC, only three output formats are supported, with a limited set of modes and options. Similarly, when it comes to video resolution, you may stumble upon the following error: Video width XX not within range from 48 to 4096. The maximum resolution is 4096×4096, but newer cards raise this limit to 8192×8192 for some formats.

Last but not least, consumer-grade NVIDIA cards (GeForce) have a limit on the number of simultaneous NVENC encoding sessions. Fortunately, this limit can be easily circumvented.

Removing the limit on the maximum number of NVENC sessions

To extract a bit more money, NVIDIA restricts the number of simultaneous NVENC encoding sessions on consumer-grade GPUs. Until about six years ago, this limit was as low as two; since then, it has increased to 3, 5, 8, and currently 12 (according to the support matrix). This limit can be easily removed altogether, however, by changing a few bytes in the driver’s libnvidia-encode.so.* libraries. The open-source community maintains nvidia-patch, a script that applies this change automatically.

Note

An interesting article shows the reverse-engineering process that led to the creation of the patch.

Calling ./patch.sh should result in something like this:

Detected nvidia driver version: 550.163.01
libnvidia-encode.so
Attention! Backup not found. Copying current libnvidia-encode.so to backup.
7786f14e4baa4d93b9bfcbf3d90645c650e1414b  /opt/nvidia/libnvidia-encode-backup/libnvidia-encode.so.550.163.01
6e59ed85eb08ba3629cf7f675a1507650558fc78  /usr/lib/x86_64-linux-gnu/nvidia/current//libnvidia-encode.so.550.163.01
Patched!

After applying the patch, perform a check by running multiple instances of FFmpeg simultaneously. The script below does this with fourteen processes:

for n in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
  ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v mpeg2_cuvid -i input-file.ts -c:v h264_nvenc output-${n}.mp4 &
done
wait

nvidia-smi should show all fourteen processes running:

+------------------------------------------------------------------+
| Processes:                                                       |
|  GPU   GI   CI        PID   Type   Process name       GPU Memory |
|        ID   ID                                        Usage      |
|==================================================================|
|    0   N/A  N/A    646221      C   ffmpeg                 423MiB |
|    0   N/A  N/A    646222      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646223      C   ffmpeg                 423MiB |
|    0   N/A  N/A    646224      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646225      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646226      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646227      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646228      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646229      C   ffmpeg                 425MiB |
|    0   N/A  N/A    646230      C   ffmpeg                 441MiB |
|    0   N/A  N/A    646231      C   ffmpeg                 437MiB |
|    0   N/A  N/A    646232      C   ffmpeg                 423MiB |
|    0   N/A  N/A    646233      C   ffmpeg                 423MiB |
|    0   N/A  N/A    646234      C   ffmpeg                 433MiB |
+------------------------------------------------------------------+

FFmpeg compilation

The standard Debian ffmpeg package supports NVIDIA hardware acceleration out of the box, provided the proprietary drivers are installed. However, you may sometimes need to compile FFmpeg yourself, for example to get additional features or bug fixes. The canonical (but slightly dated) tutorial for compiling on Debian is on the FFmpeg website. NVIDIA also provides its own guide for compiling specifically with NVENC/NVDEC support. On my test machine, the CUDA version is 12.4, and the driver version is 550.163.01. I chose the latest stable FFmpeg release that works with CUDA 12.4: 7.1.3.

I assume the NVIDIA drivers and CUDA are configured correctly. First, you need to install at least the base dependency packages. You may also need additional packages, depending on which FFmpeg features you want. For example, I included libsmbclient-dev because I’m compiling with Samba support:

sudo apt-get install build-essential yasm cmake libtool libc6 libc6-dev unzip wget libnuma1 libnuma-dev nasm gcc git libass-dev

mkdir ffmpeg-nvidia
cd ffmpeg-nvidia

Clone the nv-codec-headers repository. These headers are required to interface with NVIDIA’s codec API. Debian tends to lag behind on CUDA, so I checked out a non-latest version (12.2.72.0):

git clone https://github.com/FFmpeg/nv-codec-headers.git
cd nv-codec-headers
git checkout n12.2.72.0

cd ..

Clone the FFmpeg repository and check out a sufficiently recent stable release. To work with nv-codec-headers version 12.2, I checked out FFmpeg 7.1.3:

git clone https://git.ffmpeg.org/ffmpeg.git
cd ffmpeg
git checkout n7.1.3

Now it’s time to run ./configure. Pretty standard. I enabled a number of codecs, as well as Samba support. Be aware of the --extra-cflags, which points to the nv-codec-headers directory:

./configure \
--extra-cflags=-I../nv-codec-headers/include \
--disable-debug \
--enable-gpl \
--enable-libass \
--enable-libfreetype \
--enable-libmp3lame \
--enable-libnpp \
--enable-libopus \
--enable-libsmbclient \
--enable-libtheora \
--enable-libvorbis \
--enable-libx264 \
--enable-nonfree \
--enable-vdpau \
--enable-version3 \
--enable-ffnvcodec \
--enable-cuda \
--enable-cuda-nvcc \
--enable-cuvid \
--enable-nvenc

If everything went well, you should see a report listing the enabled features. Now, compile everything:

make -j 8

After a while, the compilation should produce two binaries: ffmpeg and ffprobe.