Wednesday, December 26, 2018

Designing a Programming Language: 10Lang.

I have been programming since 1982, I guess that shows my age. But in those 36 years, I have never designed nor implemented a programming language. With the current anti-C++ sentiments on Twitter, it made me ponder on what a better language would look like. So let's start the first tiny steps towards eliminating that "design a programming language" bucket-list item.

Let's start with people jokingly call the hardest part, but really is, the easiest part: the name. As I want my hypothetical language to succeed the prince of all programming languages, C, it only follows that I should name it as a successor the C. Seeing C++ and D are already taken, I see no other option that to call it 10. Or for better Googlability: 10Lang.

If you look at C, you can kind of see that C code is pretty closely modeled after the hardware. Things in the language often have a direct representation in transistors. And if they don't, it's at least not too remote from the silicon, as more complex and more abstract languages are.

What would a good programming model be for today's hardware? For that, let's just look at the main differences between a 1970s processor and a 2018 processor.

One big difference, is that today's CPU has a relatively slow memory. And by slow, I mean, very slow. The CPU has to wait an eternity for data that is not in a register or a cache. This speed discrepancy means that today's CPU can be crippled by things as simple as branches, virtual function calls, irregular data access. What could we do to lead the programmer's mind away from OOP or AoS thinking, and naturally guide her into the SoA or Structure-of-Arrays approach?

Design Decision ALPHA
Arrays come first, scalars second.
If we want our code to be efficient, we need to make sure the CPU can efficiently perform batch operations. The one-off operations can be slower, they are not important from a performance viewpoint. By default, operations should be performed on a range, and doing operations on a scalar should be the special case. Maybe by treating it as array of size 1, maybe by treating it with more verbose syntax, we'll see what works later.

The next big difference between that 1970s processor and today's are the SIMD units. It's probably one of the most distinctive features of a processor nowadays, and will dictate what the register file will look like. So, if we are going to model the programming language after the transistors, then there really are no two ways about it...

Design Decision BETA
SIMD is a first class citizen in the language.
I haven't figured out on how to approach this specifically, yet. First, there is the overlap with design decision ALPHA. SIMD really is kind of like an in-register array. Next, there is the consideration of whether to let the register-width seep through into the language. The C programming language was always pretty vague about how many bits there should be in a char, short or int value. I'm not sure if that was helpful in the long run. But there is int8_t int16_t int32_t now, of course. Would our 10Lang benefit from explicit SIMD widths in the code? I'm hesitant. Maybe yes, maybe no. If we concentrate on 32-bit float/integer values for now, x86_64 can pack 4, 8 or 16 of those 32b values in a 128, 256 or 512 bit SIMD register, using SSE, AVX or AVX-512. I don't believe in accommodating old crap like SSE, so that leaves us with 8 or 16 lanes of 32bits each. (For 64bit values, this would be 4 or 8, which complicates matters further, so let's ignore those for now.) One possibility would be to have native octafloat, octaint, hexafloat, hexaint types. Heck, in an extreme version of language design, we could even leave out the scalar float and int, so that the CPU would never have to switch values from the scalar register file to the SIMD register file, or vice versa. Do we need to accommodate byte access? Maybe not? Text character's haven't been 7 bit ASCII for a long time now. TBD.

Modern processors are complex monstrosities. As we don't want to stall on branching, CPUs make tremendous efforts in predicting branches. At what cost? At the cost of CPU vulnerabilities. Branches hinder efficiency. Instead of making them faster with prediction, why can't we just focus on reducing them instead?

Design Decision GAMMA
Conditional SIMD operations are explicit in code.
So, how do we reduce branch operations? Because memory is now so obscenely slow, often the fastest way to compute things conditionally, is not to branch, but to compute both the TRUE branch and the FALSE branch, and then conditionally select values on a per-SIMD lane basis. Processors have a construct for this. AVX calls this vector blend (as in VBLENDVPS), ARM NEON calls it vector bitwise select (as in VBSL.) The programmer using 10Lang should have explicit access to this construct, so that he may write guaranteed branch-free code if he so chooses. Writing branch-free SIMD code is how you end up with something that is crazy efficient, and just tears through the data and the computations.

With these three main design decisions, it should be possible to sketch out a programming language. So a C like language, but for arrays and SIMD hardware. And I wonder if it could be possible to implement a rudimentary prototype by just using a preprocessor. Just have the 10Lang code translated into plain C with the help of the immintrin.h header file.

But for now, it is feedback time. After reading this, it would be great if you could drop a note in the comments. A folly? An exercise worth pursuing? Let me know!