Wednesday, June 7, 2023

Frames per Joule.

We have fallen for the folly of maximum GPU performance. It is the age of 450 Watt GPUs. Why is that?

There is absolutely no need to have 450 Watt GPUs. The sole reason they use that much, is for a few additional FPS in the benchmarks. For real-world use, that 450 Watt is useless, as I have come to conclude after running an experiment.

Traditionally, I have always used NVIDIA GPUs in my Linux boxes, for stability and performance. But today, I am running an (elderly) AMD RX580 in my main rig. NVIDIA on Linux does not give you much control over clock speeds and voltages, but CoreCtrl let's me adjust the clocks for the RX580.

I already knew that for the last few percent of performance, you need a heap more wattage. This is because Wattage increases linearly with clock speed, but quadratically to the Voltage. Increasing Voltage comes at a huge cost. Let's quantify that cost.

I benchmarked my Photon Mapping engine on two settings: 900Mhz GPU, 1000 MHz VRAM versus: 1360MHz GPU, 1750 MHz VRAM.

I then checked the FPS and the (reported!) Wattage. I guess for a more complete tests, measuring power consumption at the wall socket would have been better. Maybe another time.

So, with the higher clocks, we expect better performance of course. Did we get better performance?

     Clocks    FPS     Volt   Watt
   900/1000     38    0.850     47
  1360/1750     46    1.150    115

The result was even more extreme than I expected: For 145% more power consumption, I got to enjoy 21% more frames/s.

Not only that. The high-clock setting came with a deafening fan noise, whereas I did not hear a thing at low Voltage.

Which now makes me ask the question: Shouldn't those benchmarking youtube channels do their tests differently? Why are we not measuring --deep breath-- frames per second per (joule per second) instead? (As Watt is defined as Joule/Second.)

We can of course simplify those units as frames per joule, as the time units cancel each other out.

Hot take: LTT, JayzTwoCents, GamersNexus should all be benchmarking GPUs at at lower Voltage/Clock setting. I do not care if RTX inches out a Radeon by 4% more FPS at 450Watt. I want to know how they perform with a silent fan, and lowered voltage. Consumer Report tests a car at highway speeds, not on a race track with Nitrogen Oxide and a bolted on Supercharger. We should test GPUs more reasonably.

Can we please have test reports with Frames per Joule? Where is my FPJ at?

UPDATE: I can't believe I missed the stage-4 galaxy brain. We can invert that unit to Joules/Frame instead!

Friday, June 2, 2023

The lacking state of rPi4 GPU support.

Because I am a sucker for Linux SBCs, I got myself a Sipeed Lichee Pi 4A RISCV SBC. And remarkably, it can do OpenCL too! Which means I can try my Photon-Mapping CL kernel on it.

Even though it can run OpenCL kernels, I did find that the OpenCL implementation is not the most stable one: As soon as I try to fill a buffer with clEnqueueFillBuffer() it will crash.

The performance of my kernel on this SBC was incredibly bad though: three orders of a magnitude slower than a desktop GPU. Which made me wonder what the performance would be on my Raspberry Pi 400.

Eben Upton designed the VideoCore IV GPU in the original rPi. So surely, the software support should be excellent. Sadly, there is no OpenCL for the rPi4.

CLVK to the rescue! This sofware will translate OpenCL kernels to Vulkan compute kernels. Even though RaspberryPi OS would not work with it, Ubuntu 23.04 for rPi would, as it had a newer Mesa.

Which brings us to the next disappointment: even though there is a Vulkan driver for the rPi4, this driver lacks important Vulkan extensions. It cannot handle 16-bit floats, nor can it handle 8-bit ints.

  unknown NIR ALU inst: vec1 16 div ssa_217 = b2f16! ssa_182

This leaves us with a rather lacking implementation. Since I wrote my Photon Mapper for FP16, this experiment has to be shelved.

Wednesday, April 26, 2023

Practical differences cuda / opencl

Once again, I find myself porting CUDA code to OpenCL. And I had to remind myself of the process, and the differences between the two. Let's document, so future Bram will catch up quickly again.

The major difference is of course compatibility. OpenCL is supported by Nvidia, Amd, Intel, etc. CUDA is only supported by Nvidia. It's the whole reason of this porting exercise.

The CUDA-kernel can be pre-compiled by the developer using nvcc, and then be shipped as a PTX file. The OpenCL-kernel typically ships as C-like source.

The function definition of a CUDA-kernel gets prefixed with __kernel whereas an OpenCL uses a __global__ prefix.

A CUDA-kernel uses blockIdx and blockDim to determine which sub-part of the input data it is supposed to process. An OpenCL kernel uses get_global_id() instead.

Using 16 bit floating point arithmetic is as easy as #include <cuda_fp16.h> and using __half. On OpenCL you need to check for an extension and add #pragma OPENCL EXTENSION cl_khr_fp16 : enable

In CUDA, constant input data is prefixed with __constant__ which the host then sets with the cuMemcpyHtoDAsync() function after getting the symbol using the cuModuleGetGlobal() function. In OpenCL if the constant input data is large, you prefix it with __global__ and the host sets it with the clCreateBuffer() and clEnqueueWriteBuffer()functions.

Half float literals need a constructor in CUDA: __half(1.0) but can use a suffix in OpenCL, where 1.0h can be used.

Many operators need a function call in CUDA. The reciprocal in CUDA is hrcp(x) and a simple <= on a half float becomes __hlt(a,b) which makes it less legible.

A minor diffence is that CUDA's atomicAdd is atomic_add in OpenCL.

CUDA has asserts but OpenCL does not.

Tuesday, April 4, 2023

Minecraft Server

When it comes to video games, I don't think Minecraft has been surpassed. Like all the best games, it has a simple premise: mine blocks in an infinite world, and use them to craft other blocks. But from this simple fundement, a master piece arose.

Even after the purchase by Microsoft, the game is still available for Linux, as Mojang maintains a Java version of their game. Lately, I have been looking into running a server, and intend to write custom code, server-side, to see what can be done with that. Minecraft lets you modify the client code. But this post investigates the modification of the Minecraft server.

Minecraft modding has a long history. But it appears that my best entry point would be SpigotMC.

After executing BuildTools.jar, I ended up with a file called spigot-1.19.3.jar which I copied to my minecraft server root directory. Instead of running minecraft_server.1.19.3.jar I now run the spigot jar file instead.

To make sure newer minecraft clients can connect, you can put ViaVersion.jar in the plugins directory.

Next up, is finding out how to write custom server code. So far I have found Kody Simpson's Youtube series an excellent resource.

Bukkit is a Java Interface, and it is implemented by CraftBukkit.

CraftBukkit is implemented using net.minecraft.server code.

This means that the same object can be referenced and used at three different levels. For instance, the player exists as Bukkit's Player, as CraftBukkit's CraftPlayer and as Mojang's EntityPlayer.

Only the Bukkit interface remains stable between releases, and can be used on different minecraft server versions. Also, the NMS (net.minecraft.server from Mojang) code is decompiled, and obfuscated.

Does coding an NPC require NMS? Or can it be done in SpigotMC? If not, what about PaperMC?

To deal with the obfuscation of n.m.s. there are the so called Mojang mappings. Initially I was unable to use those with Java 19. But using a development version of SpecialSource I was able to use n.m.s. with Java 19.

When getting confused in API terminology, this is a good reminder.

I noticed that using the remapped API, really slowed down my iteration speed. Building my plugin went from a few seconds, to 1 minute, because the remapper is slow.

The first restults from my experiments: The Little Vex That Could. A construction-buddy that builds castles from IKEA flat-packs.

Monday, March 27, 2023

Espressif IoT Development Framework

So far, I have been programming my µcontrollers using the Arduino IDE. As I progressed with my TurboLEDz experiments, I came across different µcontrollers, including the ESP32 series from Espressif.

Espressif actually has their own SDK, which targets the command line. As I feel quite at home on the command-line, I decided to switch over to the Espressif SDK, which is called ESP IDF, or ESP IoT Development Framework.

This environment is based on cmake and python tools that wrap it. Nowadays, it recommends to use it with vscode IDE, but frankly, I found it to work better without.

You mainly interface with the build environment using the idf.py tool. The typical sequence would be:

    $ idf.py set-target esp32s2
    $ idf.py build
    $ idf.py flash
    $ idf.py monitor
  

And what I really appreciate from ESP IDF is that you can code in pure C, without any C++ stuff involved. No need for .ino files, either. Plain, simple C. I love it.

Speaking of Internet of Things... I've always found it to be a nebulous term. But I guess it pertains to devices that don't have a screen, keyboard, mouse, but do have Ethernet/WiFi? And thanks to the march of progress, those tiny little devices are typically as capable as desktop computers were two decades ago?

My favourite µcontrollers are RISCV based. And some devices really blur the line between µcontroller and single-board-computer. Like the Pine64's Ox64 which is powerful enough to run Linux (but witout the protected memory, I think.)

I also like the ESP32 with two cores, a lot. It enabled me to create this voice-changer where the two cores can tag-team: one core is computing the Fourier Transformations while the other is doing I2S, and then they switch. So one core does everything for the odd-numbered batches with samples, and one core does everything for the even-numbered batches. Nice low-latency concurrency. Yay! But that project was all in Arduino IDE.

For my fist IDF project, I wrote a single-header-file piece of code for use with Adafruit's LED backpack. See my ledbp.h gist. It sits on top of Espressif's driver/i2c.h code.

Friday, December 16, 2022

The $8 Linux computer (part 2.)

Caveat Emptor! At the time of writing, ethernet, wifi and SD-Card are not working. It's good to realize this before jumping into this.

Things have improved a little, since writing PART 1 of my experience. Instead of a patched bflb-mcu-tool to flash, it is now possible to use a modified picoprobe [DOWNLOAD HERE] with the Dev Cube flashing tool.

Also, there is a patch for the PIN definitions so that we can capture the UART of the D1 core (which is a T-HEAD C906) on pins GPIO17/GPIO16.

With the fixed pin definitions we can hook up a USB to UART dongle, like the CH340G, to the second UART as shown below.

In this picture, the red dongle is to capture the output from the linux console running on the T-HEAD C906 core. The pink dongle is to program the Ox64. Both of these need to be connected to your PC. The Ox64 itself needs no connection to the PC, other than a power feed into the Micro-USB.

So, once we wired this all up, flashed to binaries, what do we get?

We get root!

And that's after booting in a blistering 0.22 seconds!

Full boot log.

I love Linux on a diet. So mean and lean.

Next up: Can we get networking?

Thursday, December 15, 2022

The $8 linux computer (part 1.)

Yes, you read that right... for $8 you can run linux on a RISC-V computer. And not just a stripped down kernel without an MMU. No, the Ox64 comes with capable RISC-V cores. Heck, and not just one core, but three cores (all different.) Exciting times!

Caveat Emptor! At the time of writing, ethernet, wifi and SD-Card are not working. It's good to realize this before jumping into this.

It looks like a microcontroller, but behaves like a UNIX machine. Sweet, or what? Getting it up and running, though, is far from trivial. There are so many hurdles to take!

Hurdle 1: You cannot program it using USB, like you would an Arduino, ESP32, Stamp-C3 or rPi Pico. Nope... this needs to be programmed via the UART TX/RX pins. In the picture above, you can see them hooked up to GPIO14 / GPIO15 and GND.

Hudle 2: The flashing tool Dev Cube will crash on launch when using Ubuntu on Wayland: you have to switch to Xorg instead.

$ ./BLDevCube-ubuntu 
Warning: Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway.
/usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so: undefined symbol: g_log_structured_standard
Failed to load module: /usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so
GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.

(BLDevCube-ubuntu:54646): GLib-GIO-ERROR **: Settings schema 'org.gnome.settings-daemon.plugins.xsettings' does not contain a key named 'antialiasing'
Trace/breakpoint trap (core dumped)
  

So, UART programming then... fine, we'll just use one of those little CH340G dongles. I got one off Amazon, but it would not work... every time the handshake would fail in Dev Cube (hurdle 3).

Luckily, some people found out that you can program it with an RP2040 device, like the rPi Pico. I had a KB2040 from Adafruit lying around, so used that.

Which puts us at hurdle 4: How to turn a KB2040 into a serial bridge between PC and Ox64? Hook up GPIO-4 of the RP2040 to GPIO-15 of the Ox64. Hook up GPIO-5 of the RP2040 to GPIO-14 of the Ox64. Also, connect the grounds, of course. See below how to hook up for flashing the M0.\

Then we need to load our RP2040 with picoprobe which we can do by holding BOOT while powering up, and feeding it the uf2 file with picoprobe.

Ok, so now we can flash it then? Unfortunately, no, hurdle 5: DevCube is still not able to flash the Ox64, but we can use bflb-mcu-tool instead.

Ok, but with bflb-mcu-tool, we are all set! Are we? Introducing hurdle 6: we need to modify the tool before it will work for us. This means stripping DTR and RTS stuff from the tool by editing the ~/.local/lib/python3.10/site-packages/bflb_mcu_tool/libs/bflb_interface_uart.py file. Here is my copy.

So with the Ox64 powered via MicroUSB, and the RP2040 connected to my PC using USB-C, I can finally flash something. For that, I took the low_load_bl808_m0.bin from the bl808 linux guide and flashed it with:

$ bflb-mcu-tool --chipname=bl808 --port=/dev/ttyACM2 --baudrate=2000000 --firmware=out/low_load_bl808_m0.bin
    

After which I now get output from my Ox64, via my KB2040 to my PC:

dynamic memory init success,heap size = 156 Kbyte 
E907 start...
mtimer clk:1000000
psram clk init ok!
    

Ok, a modest start, but the Ox64 is talking to my PC now! Yay! Come back to my blog at a later time to see if I will manage to run Linux on the Ox64! Because as I understand it, I also need to flash something to the D0 and then flash the actual OS as well.

UPDATE: Here is PART 2.