Tuesday, October 20, 2020

Exit vi less often.

I am from the school of thought that recognizes the fact that typing is not the bottleneck for software engineering. So I am perfectly fine to use my command line, my vi and my Makefile. My peers consider me a weirdo for not using an IDE. Additionally, a powerful debugger is not a substitute for thought either, as expertly expressed by Rob Pike.

Not using an IDE means a more direct control over what happens. No schemas, no vcproj xml, no burried GUI fields, no property sheets. Or as I call it: bliss!

But fair enough, I spend a non trivial amount of time opening and closing vi, because I need to navigate my code. What could we do to reduce that?

Remember your place

First off, it helps if vim (I'll use vim and vi interchangably) remembers where you cursor was, last time you edited a file. For this, we have the following snippet in .vim/vimrc configuration file:

if has("autocmd")
  au BufReadPost * if line("'\"") > 0 && line("'\"") <= line("$") | exe "normal! g`\"" | endif

Toggle to/from header file

Next up: we want to be able to quickly toggle between header file and implementation file. Because I mix up C and C++, I unfortunately have three instead of two shortcuts for that. \oh to open the header, \oc to open the .c file, and \oC to open the .cpp file. You can have those too, with this in the configuration:

nnoremap oc :e %<.c
nnoremap oC :e %<.cpp
nnoremap oh :e %<.h

:set timeoutlen=2000

Note: The time out is added, so that you have some time between pressing the leader key (backslash) and the command.

Jump to declaration

But how do I quickly jump to the declaration of a particular function? Well, there is a mechanism for that too, called include jump. The vim editor can parse your include statements, so that with the :ij foo command, you can open up the header that declares the foo() function. But you do need to point vim to the path of files to try, which I do with:

set path=.,**

Jump to another file, regardless of location

If you just want to jump to a named file, but don't want to be typing full paths, you can use the :find bar.cpp syntax.

So there we have it: less quiting and starting vim. For completeness, my entire .vim/vimrc configuration is below:

filetype plugin on

set path=.,**

if has("autocmd")
  au BufReadPost * if line("'\"") > 0 && line("'\"") <= line("$") | exe "normal! g`\"" | endif

set wildignore=*.d,*.o

nnoremap oc :e %<.c
nnoremap oC :e %<.cpp
nnoremap oh :e %<.h

:set timeoutlen=2000

autocmd Filetype python setlocal noexpandtab tabstop=8 shiftwidth=8 softtabstop=8

Tuesday, September 29, 2020

7 Years

Amazingly, the PC that I built has lasted me a whole 7 years. It is still a fine PC. It had one or two graphics card upgrades, and some extra RAM, but other than that, a Haswell CPU is still a perfectly fine workhorse.

I would have continued using that PC for even longer, if it were not for two things: One: I got curious about AVX-512 and what I could do with that. Two: my daughter showed interest in PC tech, so I thought this would be a good educational opportunity, so I let her help me build it.

Annoyingly, many, if not most, of Intel's new processors still do not come with AVX512. They keep introducing CPUs that don't have it. Why? However, there are some interesting routes to take, if you want AVX512. One of them is to get this obscure little appliance: Intel Crimson Canyon NUC. But going back to just 2 cores? Eh....

Another interesting route to take to AVX512 is to buy what corporate and pro users no longer want: old Xeons. Take, e.g a professional Mac user that bought the 2017 Mac Pro with 8 cores, starting at the $4999,- price. Some time later, this pro-user needs to have a faster Mac, so they upgrade the Xeon CPU. What happens to that old W-2140B Xeon that was replaced? It gets dumped on eBay for $200 or so! Similarly, corporate users dump high end Supermicro Xeon boards there too.

So why not snap those up, and build our own Xeon based workstation? That was the plan, and that was what happened. I am now the proud owner of a used PC, suited to replace my aging Haswell.

Some things I learned along the way:
  • Some fans (looking at you, Noctua) do not reliably report the RPM, so the motherboard can sometimes read a '0' value, causing it to panic, and put all fans on full blast. So out with the Noctua, in with an aging repurposed Corsair fan, yanked from a broken watercooling kit.
  • I bought a really cheap audio card from Amazon, thinking it would be fine. But the System Event Log (SEL) of Supermicro actually showed that there were parity errors on the PCIexpress bus. This went away after yanking out the sound card.
  • Initially this system ran with an antique GTX750Ti. As a developer I need to test with a variety of GPUs to improve my code, so I thought I should replace it with a Radeon. The quality of the AMD GPU PRO drivers being what it is: the resulting stress is just not worth it. That GPU went back to Amazon.
  • The InWin 301 MicroATX case has two fan-mounts on the front of the case, yet a sealed front with no air inlet. So you end up using the bottom fan mounts, blocking a PCIExpress slot with it. It doesn't seem to be a smart design.
  • Even though the CPU and Motherboard can be purchased at discounted prices on eBay, the ECC RAM that is needed is still full price.
  • I designed the system for low power usage. So a PSU rated at 550W and 80+ GOLD, seemed good enough. In hindsight, more head room would be better: I read that PSUs perform best at their 50% level or so.
  • It is quite interesting to be able to manage your system on a side-channel. With IPMI you can manage many aspects of your system.
  • To set sensor limits, use ipmitool sensor thresh FAN1 lower 100 200 300 and ipmitool sensor thresh FAN1 upper 3000 4000 5000.

Wednesday, September 23, 2020

Solving Collisions

I've been developing games for a long time now, but solving collisions remains a hairy subject. The Physics Engine I have used the most is OpenDE. It worked well for Little Crane. But the fact that OpenDE considers a triangle mesh as a generic 'triangle soup' as opposed to a closed surface, tends to get me into trouble. Consider the figure, below, where a wheel intersects the terrain.

Here, we expect triangles A and B to cause the wheel to be pushed out of the terrain. And we expect this correction in the direction of the triangle normals for A and B. Unfortunately, this is not always what happens. As OpenDE considers every triangle individually, it will report collisions on internal edges of the mesh, and tries to correct them with collision normals that lie in the triangle plane, as depicted below.

To make matters worse, other edges can cause collision constraints that push in the completely opposite direction, causing the wheel to get stuck in the terrain.

So what are the options here? One approach is to ignore all collision contacts with collision normals that do not align with the triangle normals. This can cause sinking into the terrain.

Another approach is to correct all reported collision normals, and override them with the triangle normals. This seems to work reasonably well, until you hit a 'thin' part of the terrain, where the wheel goes inside the terrain, and emerges from the other side, out of another triangle, as depicted below.

Blindly using the triangle normals as collision normals leads to bad things here: The wheel is simulateously restricted to go only up and to go only down, meaning the solver will freeze the wheel in place! To solve this, we need to somehow detect one collision as being the first, and ignore the other one.

At the end of the day, filtering the contact points that your physics engine gives you is a non-trivial problem. Limiting the maximum velocities, and using tiny timesteps go a long way, but even then you can get into trouble. If you can, building your terrain out of convex shapes only will save you a lot of troubles, as there is always a well defined inside and outside, making the collision resolution simpler. With generic triangle meshes, you have to be careful.

Saturday, September 19, 2020

OpenCL on a Radeon

So, the game I am currently developing was written around Procedural Generation of terrain, using the CUDA language. CUDA is great for doing GPGPU (General Purpose computing on a Graphics Processor.) It's well thought-out, and straightforward to use. It only has one drawback: it will only run on nVidia GPUs.

If I am to sell my game on Steam, it will have to run on AMD GPUs as well. So that means supporting something besides CUDA. The most portable way of doing GPGPU, is GLSL, but that is very cumbersome, as you need textures to get your data out, for starters. The next most portable way would be OpenCL.

At the time of porting from CUDA to OpenCL, I did not have an AMD GPU, so I did the OpenCL port using my nVidia GPU. OpenCL is a little rougher around the edges than CUDA, but the port did work fine, and ran at the same speed too. So it was time to test it on AMD hardware. As my freshly built Xeon Workstation reused a very aging GTX 750 Ti, it was upgrade time anyway, so out with the GTX 750 Ti, and in with the Radeon RX 5500 XT.

The last time I used a Radeon, the linux drivers for it were a mess, and worse, left your Ubuntu install in a mess too, by using it. In 2020, things are easier, and Ubuntu supports it out-of-the-box with an Open Source driver. However, that Open Source driver has limited capabilities. For starters, it comes without OpenCL, the sole reason why I purchased the Radeon.

So out with the Open Source driver, and in with the proprietary driver. These are the steps I had to take to install OpenCL support for AMD on Ubuntu:

  • Download the proprietary driver from AMD's website.
  • Unpack the archive.
  • The driver comes in two flavours: consumer and pro. You need the pro version.
  • Install as root with: ./amdgpu-pro-install
  • # dpkg -i opencl-amdgpu-pro-comgr_20.30-1109583_amd64.deb
  • # dpkg -i opencl-amdgpu-pro-icd_20.30-1109583_amd64.deb
And now, I can run my OpenCL code:
    OpenCL 2.1 AMD-APP (3143.9) AMD Accelerated Parallel Processing Advanced Micro Devices, Inc. has 1 devices:
    gfx1012 gfx1012 with [11 units] localmem=65536 globalmem=8573157376 dims=3(1024x1024x1024) max workgrp sz 256
I am not sure why it says [11 units] though, as Wikipedia lists the RX 5500 XT as having 22 cores. Hopefully I didn't get scammed with the hardware.

So on Linux, at least, my code now works both on nVidia and AMD, and I can use either CUDA or OpenCL to generate worlds from Open Simplex Noise, like shown below. TODO: Windows port.

Thursday, May 14, 2020

A Random Direction.

A naive way of generating a random direction d(x,y,z) would to take random values for x,y,z and then normalize the resulting vector d. But this will lead to a skewed distribution where too many samples fall in the "cube corners" directions.

A proper way to generate a random direction d(x,y,z) is to do the same, but add rejection-sampling.

  • If the resulting vector has length < 1, then normalize the vector, and use it.
  • If the resulting vector is longer, then reject it, and try again.

The downsides of the proper method are:

  • It is slower.
  • Freak occurrences of having to retry many times.
  • It has branch instructions in it.

I want the fastest possible, 8x SIMD way of generating random vectors. That means, no branching. And that got me thinking about a direction generator that is fast, but less skewed than the naive way. We would tolerate a little bias at the benefit of speed.

An approach I came up with: just generate two sets of coordinates, yielding two candidates. Use the candidate with the shortest length. Picking that candidate can be achieved without any branching, by just using the _mm256_blendv_ps intrinsic.

In pseudo code:

// 8-way vector for the candidates a:
__m256 cand_ax = [random values -1..1]
__m256 cand_bx = [random values -1..1]
__m256 cand_cx = [random values -1..1]
// get lengths for candidates in a and b.
__m256 len_a = length_8way( cand_ax, cand_ay, cand_az );
__m256 len_b = length_8way( cand_bx, cand_by, cand_bz );
// pick the shortest candidates in each lane.
__m256 a_is_shorter = _mm256_cmp_ps( len_a, len_b, _CMP_LE_OS );
__m256 cand_x = _mm256_blend_ps( cand_bx, cand_az, a_is_shorter );
__m256 cand_y = _mm256_blend_ps( cand_by, cand_ay, a_is_shorter );
__m256 cand_z = _mm256_blend_ps( cand_bz, cand_az, a_is_shorter );
__m256 len    = _mm256_blend_ps( len_a, len_b, a_is_shorter );
// normalize
__m256 ilen = _mm256_rcp_ps(len);
__m256 x = _mm256_mul_ps( cand_x, ilen );
__m256 y = _mm256_mul_ps( cand_y, ilen );
__m256 z = _mm256_mul_ps( cand_z, ilen );

What is this all good for? You can generate noise fields with random gradients and not having to resort to lookup tables. No pesky gathers either! Just create the gradient on-the-fly, each time you sample the field. There is no memory bottle neck, it is all plain computing, without any branching. Added bonus: no wrap-around of the field due to limited size of lookup table.

While implementing, I noticed that I forgot something: when sampling the field, I need to construct the random directions at 8 gridpoints, so it will be a lot slower than a table lookup, unfortunately. Oh well.