The Little Engineer That Could: 2018

Wednesday, December 26, 2018

Designing a Programming Language: 10Lang.

I have been programming since 1982, I guess that shows my age. But in those 36 years, I have never designed nor implemented a programming language. With the current anti-C++ sentiments on Twitter, it made me ponder on what a better language would look like. So let's start the first tiny steps towards eliminating that "design a programming language" bucket-list item.

Let's start with people jokingly call the hardest part, but really is, the easiest part: the name. As I want my hypothetical language to succeed the prince of all programming languages, C, it only follows that I should name it as a successor the C. Seeing C++ and D are already taken, I see no other option that to call it 10. Or for better Googlability: 10Lang.

If you look at C, you can kind of see that C code is pretty closely modeled after the hardware. Things in the language often have a direct representation in transistors. And if they don't, it's at least not too remote from the silicon, as more complex and more abstract languages are.

What would a good programming model be for today's hardware? For that, let's just look at the main differences between a 1970s processor and a 2018 processor.

One big difference, is that today's CPU has a relatively slow memory. And by slow, I mean, very slow. The CPU has to wait an eternity for data that is not in a register or a cache. This speed discrepancy means that today's CPU can be crippled by things as simple as branches, virtual function calls, irregular data access. What could we do to lead the programmer's mind away from OOP or AoS thinking, and naturally guide her into the SoA or Structure-of-Arrays approach?

Design Decision ALPHA
Arrays come first, scalars second.
If we want our code to be efficient, we need to make sure the CPU can efficiently perform batch operations. The one-off operations can be slower, they are not important from a performance viewpoint. By default, operations should be performed on a range, and doing operations on a scalar should be the special case. Maybe by treating it as array of size 1, maybe by treating it with more verbose syntax, we'll see what works later.

The next big difference between that 1970s processor and today's are the SIMD units. It's probably one of the most distinctive features of a processor nowadays, and will dictate what the register file will look like. So, if we are going to model the programming language after the transistors, then there really are no two ways about it...

Design Decision BETA
SIMD is a first class citizen in the language.
I haven't figured out on how to approach this specifically, yet. First, there is the overlap with design decision ALPHA. SIMD really is kind of like an in-register array. Next, there is the consideration of whether to let the register-width seep through into the language. The C programming language was always pretty vague about how many bits there should be in a char, short or int value. I'm not sure if that was helpful in the long run. But there is int8_t int16_t int32_t now, of course. Would our 10Lang benefit from explicit SIMD widths in the code? I'm hesitant. Maybe yes, maybe no. If we concentrate on 32-bit float/integer values for now, x86_64 can pack 4, 8 or 16 of those 32b values in a 128, 256 or 512 bit SIMD register, using SSE, AVX or AVX-512. I don't believe in accommodating old crap like SSE, so that leaves us with 8 or 16 lanes of 32bits each. (For 64bit values, this would be 4 or 8, which complicates matters further, so let's ignore those for now.) One possibility would be to have native octafloat, octaint, hexafloat, hexaint types. Heck, in an extreme version of language design, we could even leave out the scalar float and int, so that the CPU would never have to switch values from the scalar register file to the SIMD register file, or vice versa. Do we need to accommodate byte access? Maybe not? Text character's haven't been 7 bit ASCII for a long time now. TBD.

Modern processors are complex monstrosities. As we don't want to stall on branching, CPUs make tremendous efforts in predicting branches. At what cost? At the cost of CPU vulnerabilities. Branches hinder efficiency. Instead of making them faster with prediction, why can't we just focus on reducing them instead?

Design Decision GAMMA
Conditional SIMD operations are explicit in code.
So, how do we reduce branch operations? Because memory is now so obscenely slow, often the fastest way to compute things conditionally, is not to branch, but to compute both the TRUE branch and the FALSE branch, and then conditionally select values on a per-SIMD lane basis. Processors have a construct for this. AVX calls this vector blend (as in VBLENDVPS), ARM NEON calls it vector bitwise select (as in VBSL.) The programmer using 10Lang should have explicit access to this construct, so that he may write guaranteed branch-free code if he so chooses. Writing branch-free SIMD code is how you end up with something that is crazy efficient, and just tears through the data and the computations.

With these three main design decisions, it should be possible to sketch out a programming language. So a C like language, but for arrays and SIMD hardware. And I wonder if it could be possible to implement a rudimentary prototype by just using a preprocessor. Just have the 10Lang code translated into plain C with the help of the immintrin.h header file.

But for now, it is feedback time. After reading this, it would be great if you could drop a note in the comments. A folly? An exercise worth pursuing? Let me know!

Thursday, June 14, 2018

Joystick sampling rate in games.

I investigated an interesting conundrum this morning: why was my game running so much differently on my iPad PRO? The tank was snappy, and turning aggressively on the iPad PRO, but not on Linux and Android.

The main difference between iPad PRO and other platforms is its higher display refresh. But I was certain I had this covered, as I step my simulation exactly the same on all platforms: with 1/120s steps. The only difference being, that I render after each step on iPad PRO and only render once after two sim steps on other platforms that have 1/60s display refresh.

First thing to do, is to rule out differences in the iOS port. When I force my iPad to render at 1/60s instead, the iPad behaviour reverts to the same as the Linux/Android ports. Confirmed: it is the display refresh rate that makes the difference, not the platform's architecture.

So why would these two scenarios have different outcome?

[ sim 1/120s, render ] [ sim 1/120s, render ]

[ sim 1/120s,       sim 1/120s,      render ]
|                     |                     |
0ms                 8.3ms                 16.6ms

A logical explanation would be that I somehow influence the simulation somewhere, as I render. But after examining the code, nothing showed up.

It dawned on me that in the high display refresh case, the faster rendering is not the only difference. In 120Hz mode, you not only get more rendering activity, you also get more frequent input events. Touches come in faster when you render 120Hz, as they do when you render at 60Hz. Joystick changes, and touch events are batched with display refresh.

To confirm this, I put in an artificial joystick value, that would simply rotate the joystick at a set pace. Then I adjusted how those joystick changes were relayed for a 60Hz display frame. The result is the video below.

On the right, I adjust the joystick angle with 0.10 radians before each sim step. On the left, I adjust the joystick angle only once for two steps, but at double the the radians.

At 120Hz stick sampling, I get a smoother joystick signal. Even though the joystick rotation speed is the same, the 60Hz sampling shows more jarring deltas. I hadn't expect the effect on the simulation outcome this big.

The reason for the dramatic difference is that the small difference is amplified by the PID controllers I use in my game. In the case of low stick sampling rate, the PID controller will always see a zero change during the second step, and a large change in the first step. The PID controller can react a lot more effectively if it gets a higher frequency signal.

Lesson learned: these two scenarios give different simulation outcomes:

[ read stick,     sim 1/120s,     sim 1/120s,     render ]

[ read stick, sim 1/120s, read stick, sim 1/120s, render ]
|                                                        |
0ms                                                    16.6ms

Although forcing the 120Hz stick signal down to 60Hz is simple to achieve, it will be hard to provide a 120Hz stick signal if you only get your events at 60Hz. So the sweet, reactive control on iPad PRO is hard to achieve on 60Hz devices, unless you interpolate or extrapolate the stick values.

Friday, May 18, 2018

Differential Steering.

I just did a fun little exercise to figure out the steering in my Flank That Tank! indiegame. Of course, the best way to steer a tank is by using two throttle levers, one for each track. This will let the tank driver directly control the differential steering of the tank. It also enables some pretty exciting and wild maneuvers.

So really, case closed. A gamepad typically has two analog joysticks, one joystick for each track. Done!

However, I want to be able to run this game on mobile platforms using touch. And I tried it, but the lack of physical stops really hampers the feel of driving. Two levers on a touch screen simply is no proper substitute. (Not to speak of controlling two levers and a fire button, which is even harder on a touch screen.)

So no levers on a touch screen. Could we perhaps do differential steering with a single touch? Or, quite similarly: do differential steering with a single joystick?

It's fun to figure it out.

The stick at 12 o'clock would mean full steam ahead, so the L and R tracks at +100% power.
The stick at 06 o'clock would mean full steam backwards, so the L and R tracks at -100% power.
The stick at 03 o'clock would mean a hard right turn, in place, so L at +100% and R at -100% power.
The stick at 09 o'clock would mean a hard left turn, in place, so L at -100% and R at +100% power.

For the intermediate joystick positions, interpolating these four settings is all that is required.

This ought to work nicely for touch screens. Left thumb to drive the tank, which leaves the right hand free for tapping the screen to shoot, and possibly aim the turret as well.

So I'll be implementing this scheme shortly. That leaves me to consider the issue of absolute/relative control. Some people can't steer a vehicle that drives towards the camera (or in 2D: to the bottom of the screen) as it reverses L/R from the driver's point of view. So I may implement an absolute system as well: the "12 o'clock position" will adapt to where the tank is pointing.

Friday, May 11, 2018

The curious case of FPS jitter.

I tried to record a video of my game this morning, and it bugged me that it wasn't 100% smooth. The game did report 60fps though, so let's find out what is going on.

First order of business, is to graph the delta-time for each frame, instead of just reporting the frames-per-second. And sure enough, I would see jitter in the signal: a slow frame followed by a fast frame.

What was really puzzling, was that I could induce this jitter by pressing a key on the keyboard. Even if this key has no game functionality behind it, the frame time would jitter: slow+fast, for each and every press. And also for the auto-repeat events.

In the picture above, the three red lines are at values 1/60, 1/30 and 1/20 seconds. The green marks are the measured frame times. The jitter shows up for every key I press.

So, perf and FlameGraph to the rescue.

To my surprise, I notice that SDL_IBus_UpdateTextRect() shows up in the profile. Why is SDL updating text rectangles? I'm not doing any text related things. I just render to an OpenGL window. Notice how a single key press leads to an avalanche of computation and communication, with a call depth of 34 functions deep no less!

Frogtoss told me to look into SDL's Text Input system. My code never started a text input cycle, but to be sure, I called SDL_IsTextInputActive() to check. And sure enough: Text Input is active by default! Adding a SDL_StopTextInput() fixed the jitter.

Judging from the flamegraph, a key press when Text Input is active, is incredibly costly, as it involves computation, communication with the X server, polling, waking up stuff, and more. An avalanche of IO happens for every press. So for games, it's best to turn it off as soon as you have initialized SDL.

Executive Summary: after launching your SDL2 based game, call SDL_StopTextInput() for a smoother frame rate.

 if ( SDL_IsTextInputActive() )
  SDL_StopTextInput();

Post Sctript: I will try recording that video again. If it still isn't smooth, at least it is not because of this.

Test specs:
Ubuntu 18.04
SDL 2.0.8

Tuesday, May 8, 2018

GDPR and iOS developers.

Two of my mobile games on iOS use AdMob to serve ads. This makes me vulnerable to EU General Data Protection fines, as I have no clear view on what exactly is collected by AdMob. Google puts the responsibility of requesting user permission for the data that AdMob collects on me.

The safest option at this time, is for me to completely stop serving ads. And I may very well end up using that option. But I thought it may be interesting to examine other options.

Let's start with disabling all ads, but only for European customers. How feasible would this be?

Well, it starts with the ill-defined term "European customer." We need to identify exactly who the GDPR applies to. This is what the EU has to say about that:

It applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

Still imprecise, because it says nothing about the subject's location, other than residence. What about a EU citizen on holiday in the US? What about a US citizen on holiday in EU? For now, let's ignore this, and just try to determine residence.

One way, would be to check the user's locale. Typically, it would be set to the country of residence. So the use of NSLocale would be a good start. Better than the alternative of actually checking the user's location, as that would first throw up an annoying dialog requesting permission for checking location.

Is that 100% fool proof? What if user's have their locale setup incorrectly? Let me guess, the onus is on me? Hmm... completely disabling ads seems safer indeed.

Ok, disabling ads completely. Is it as simple as going to the AdMob portal, and stop the Ad servings? Unfortunately not, because AdMob would still be active on the mobile device, and after contacting the AdMob servers would learn that there is no Ad service. However, who's to say the user profile hasn't already been sent to AdMob servers anyway, before learning there are no Ads to show?

So nope, disabling ads can't be done without building and uploading new versions to the app store.

One final remark: those GDPR tools that AdMob talks about? Not there!

Wednesday, April 25, 2018

Leaving Track Prints

I am currently developing my, still unnamed, indiegame. This game is a 2D top down tank fight. And its main gimmick is the 100% destructible world.

Nice destruction if I say so myself, but notice that the tanks don't leave track prints. Today, I will write about implementing track prints that the tanks leave behind on the terrain as they drive over it.

The first observation to be made here, is that there are a lot of them, for each tank. That means generating and rendering thousands of them, if not tens of thousands or even hundreds of thousands. This immediately tells me that they can't be rendered individually. I need to apply a technique called Instanced Rendering.

Rendering

In instanced rendering, all instances share the model vertex data and have some per-instance data to make them unique. This per-instance data is typically a transformation matrix, but can also include other things like colour if need be. In my case, the per-instance data can be particularly compact because I work in two dimensions.

All the prints will be identical, except for two things: their position in the world, and their orientation. So in theory, three values would be enough: an x and y coordinate, plus a rotation angle. But personally, I find that defining rotation with a vector, like Chipmunk2D does, is more elegant. Hence, I will feed OpenGL a 2D vector for position and 2D vector for orientation.

The next thing to consider is the life-time of the prints. If we create a new print at frame N, then we will need to render it at that frame N and all other frames after it. Up until frame M (M much larger than N) where we need to evict this print to make space. After all, we don't want to run out of resources by creating arbitrary many prints.

The fact that I progressively create the prints, and reclaim resources for the oldest one, leads me to the convenient solution of ring-buffers. We create a Vertex Buffer Object to hold the shared model data plus N instances. When creating instance N+1, we will reuse the slot at position 0. Each frame, we will only write the VBO at the slots that got new data that same frame.

Generating

Having the rendering covered, leaves me the problem of generating the prints. This problem is trickier. The tank has many track-segments touching the ground at any time, all leaving a print. When the tank drives straight, those prints all superimpose, so you would only really need to generate one of them. But when the tank turns, this won't work, and gets worse if it turns-in-place. See below what happens if you leave one print at each side of the tank. The tracks look fine, until the tank does a 180 degrees spin.

And it looks particularly bad if the tank gets bumped hard and moves sideways. I haven't really cracked the problem of generating proper tracks yet. I think the root of the problem lies in the fact that the game's simulation has no concept of the track links. The tank it self is just four rigid bodies, one chassis, one turret and two for the L/R tracks. The links of the tracks are just an animation effect.

So the generation of track prints needs some more work. I'll report back when and if I solve it.

Tuesday, February 20, 2018

Returning to iOS development.

It occurred to me that the new iPad Pro 120Hz display is a great motivation to update my Little Crane game for iOS. So after a long time, I returned to my iOS codebase. Here I report some random findings.

🔴 OS Support
Currently, Little Crane supports iOS3.2 and up. But the current Xcode (9.2) does not support anything under iOS8. Oh well, abandoning a few old devices then.

🔴 Launch Image
Also scrapped by iOS: Launch Images. If you want to have support for iPad Pro, you now need new fangled Launch Screen storyboards. As more iOS devices got released, the launching process got more complex over time:

First, they were just specially named images in your bundle.
Then, they were images in an Asset Catalog.
Now, they are a storyboard with a whole lot of crap that comes with this. Oh boy.

🔴 Bloated AdMob
Scrapped a long time ago, was the iAd product. So if you want to have ads in your app, you need to look elsewhere. I went with the other behemoth in advertisements: AdMob. When upgrading from AdMob SDK 7.6.0 to 7.28.0 I was unpleasantly surprised. I now need to link to a whole bunch of extra stuff. I think ads do 3D rendering now, as opposed to just playing a video? New dependencies in Admob:

GL Kit
Core Motion
Core Video
CFNetwork
Mobile Core Services

🔴 GKLeaderboardViewControllerDelegate
Leaderboards with a delegate has been deprecated. It probably still works, so I am tempted to leave in the old code. I do get this weird runtime error message when closing a Game Center dialog though: "yowza! restored status bar too many times!"

Tuesday, February 13, 2018

Flame Graphs and Data Wrangling.

In my pursuit of doing Real Time (60fps) Ray Tracing for a game, I have been doing a lot of profiling with 'perf.' One way to quickly analyse the results from a perf record run, is by making a FlameGraph. Here's a graph for my ray tracing system:

Click here for expanded and interactive view.

During my optimization effort, I've found that lining up all the data nicely for consumption by your algorithm works wonders. Have everything ready to go, and blast through it with your SIMD units. For ray tracing, this means having your intersection routines blast through the data, as ray tracing in its core, is testing rays versus shapes. In my game, these shapes are all AABBs, and my intersection code tests 8 AABBs versus a single ray in one go. A big contribution to hitting 60fps ray tracing is the fact that my scenes use simple geometry: AABBs, almost as simple as spheres, but more practical for world building.

This is all fine and dandy, but does expose a new problem: your CPU is busy more with wrangling the data than doing the actual computation. Even when I cache the paths that primary rays take (from camera into scene) for quick reuse, the administration around intersection tests takes up more time than the tests themselves.

This is visible in the graph above, where the actual tests are in linesegment_vs_box8 (for shadow rays) and ray_vs_box8 (for primary rays.) It seems to be some wall I am hitting, and having a hard time to push through for even more performance.

So my shadow rays are more costly than my primary rays. I have a fixed camera position, so the primary rays traverse the world grid in the same fashion each frame. This, I exploit. But shadow rays go all over the place, of course, and need to dynamically march through my grid.

In order to alleviate the strain on the CPU a bit, I cut the number of shadow rays in half, by only computing shadow once for two frames, for each pixel. So half the shadow information lags by one frame.

So to conclude: if you line up all your geometry before hand, and having it packed by sets of 8, then the actual intersection tests take almost no time at all. This makes it possible to do real time ray tracing at a 800x400 resolution, at 60 frames per second, at 1.5 rays per pixel on 4 cores equipped with AVX2. To go faster than that, I need to find a way to accelerate the data-wrangling.

Friday, January 5, 2018

2017 Totals

So, The Little Crane That Could is waning. Here are the 2017 results (Number of free Downloads.) It did manage to surpass a 19M lifetime downloads.

	2017	2016	2015	2014	2013	2012	2011
iOS	191K	416K	630K	1300K	3199K	3454K	1550K
Android	1100K	1515K	1525K	825K	1579K	1656K	-
Mac		10K	20K	30K	53K	81K	-
OUYA	-	-	0K	4K	15K	-	-
Kindle	9K	48K	52K	46K	95K	-	-
Rasp Pi	-	-	?	?	6K	-	-