"Node pairing" (update: it's actually newtype-ish construct) in top .NET entries for binary-trees halves number of allocations

In discussion for proposal #391 (closed) there is some arguing whether top C# and F# entries for binary-trees halve the amount of allocations or not. In the description https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees we're told to: Don't optimize away the work.

I don't have .NET installation locally (update: I've downloaded it and made experiments with C#), but I've translated C# # 6 to Java with primitive objects (i.e. I've built Java from Project Valhalla from OpenJDK). First, some theory to show my version is equivalent to C# version: What's the difference between struct and class in .NET?

The general difference is that a reference type lives on the heap, and a value type lives inline, that is, wherever it is your variable or field is defined.

A variable containing a value type contains the entire value type value. For a struct, that means that the variable contains the entire struct, with all its fields.

A variable containing a reference type contains a pointer, or a reference to somewhere else in memory where the actual value resides.

This has one benefit, to begin with:

value types always contains a value

reference types can contain a null-reference, meaning that they don't refer to anything at all at the moment

Comparing to terminology from Project Valhalla (recent terminology example is here: https://openjdk.java.net/jeps/401 ) we see that reference types in C# correspond to identity classes in Java and value types in C# correspond to primitive classes in Java (naming can change in future, but that's what Project Valhalla defines now). The difference is that primitive classes in Java are shallowly immutable, while structs in C# (which are value types) can be mutable. Otherwise the distinction is the same in both languages: value types in C# and primitive classes in Java are inlined into other types, while reference types in C# and identity classes in Java always result in heap allocation (at least semantically).

Arguing whether the number of allocations is halved or not is somewhat pointless since we can just measure it. I've attached two binary-trees solutions in Java. One that is just binary-trees Java # 7 with added allocation counter. Second one is written using Java with primitive objects support and also has allocation counter (I've build OpenJDK myself using instructions from here: https://github.com/openjdk/valhalla/blob/lworld/doc/building.md - of course the initial git command is git clone https://github.com/openjdk/valhalla.git). Running them produces following output:

piotrek@piotrek-desktop:~/projekty/benchmarks-game-java/src/main/java$ java binarytrees_7 21
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303
Total allocations count: 613766494
piotrek@piotrek-desktop:~/projekty/benchmarks-game-java/src/main/java$ /dev/shm/valhalla/build/linux-x86_64-server-release/jdk/bin/java binarytrees_valhalla 21
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303
Total allocations count: 305485150

Allocation count is visibly halved.

After removing (costly) allocation counter (that also causes output to break benchmark rules), the timings are as follows:

piotrek@piotrek-desktop:~/projekty/benchmarks-game-java/src/main/java$ time java binarytrees_7 21
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303

real	0m2,394s
user	0m7,065s
sys	0m0,729s
piotrek@piotrek-desktop:~/projekty/benchmarks-game-java/src/main/java$ time /dev/shm/valhalla/build/linux-x86_64-server-release/jdk/bin/java binarytrees_7 21
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303

real	0m2,467s
user	0m7,231s
sys	0m0,734s
piotrek@piotrek-desktop:~/projekty/benchmarks-game-java/src/main/java$ time /dev/shm/valhalla/build/linux-x86_64-server-release/jdk/bin/java binarytrees_valhalla 21
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303

real	0m1,545s
user	0m3,376s
sys	0m0,836s

So it's visible that my OpenJDK build has the same performance on vanilla "binary-trees Java # 7 program" as official JDK 11 build, but on primitive objects it's running much faster (60% faster?).

The final question is: is such "node pairing" an allowed optimization or not? Benchmark description states "don't optimize away the work", while "node pairing" reduces stress on memory (de)allocation system by half.

If "node pairing" is an acceptable optimization then e.g. why not go a little bit further, for big tree composed of mini trees (where a mini tree is a single object from GC point of view, but contains many logical nodes)?

binarytrees_valhalla.java is based both on "binary-trees Java # 7 program" (parallelization and I/O) and "binary-trees C# .NET # 6 program" (tree nodes implementation).

binarytrees_7.java binarytrees_valhalla.java

PS: "node pairing" may be a misleading term. It was used in the original issue mentioned at the top of this issue. Maybe the correct term would be "sandwiching" as the original optimization "sandwiches" value types between reference types and the end effect is reduction of instances of reference types.

Update:

What's wrong with binary-trees C# .NET #6 program and also binary-trees F# .NET #5 program is that they count a (close to?) zero-overhead reference wrapper as a tree node, while they misleadingly name the true binary tree node (i.e. the one that actually has two children) as "Next".

binary-trees C# .NET #6 program contains:

    // that's just a (close to?) zero-overhead wrapper over a reference
    struct TreeNode
    {
        // that's the true binary tree node
        class Next { public TreeNode left, right; }
        readonly Next next;
        ...
    }

binary-trees F# .NET #5 program contains:

type Next = Next of Tree * Tree // that's the true binary tree node
and [<Struct>] Tree = Tree of Next // that's just a (close to?) zero-overhead wrapper over a reference

That zero-overhead wrapper looks a lot like a newtype pattern from Haskell: https://wiki.haskell.org/Newtype

The restriction to one constructor with one field means that the new type and the type of the field are in direct correspondence:
State :: (s -> (s, a)) -> State s a
runState :: State s a -> (s -> (s, a))
or in mathematical terms they are isomorphic. This means that after the type is checked at compile time, at run time the two types can be treated essentially the same, without the overhead or indirection normally associated with a data constructor. So if you want to declare different type class instances for a particular type, or want to make a type abstract, you can wrap it in a newtype and it'll be considered distinct to the type-checker, but identical at runtime. You can then use all sorts of deep trickery like phantom or recursive types without worrying about GHC shuffling buckets of bytes for no reason.

It's also present in Rust: https://rust-unofficial.github.io/patterns/patterns/behavioural/newtype.html

Newtypes are a zero-cost abstraction - there is no runtime overhead.

All mentioned languages can implement the "trick" that current top ranked C# and F# programs use. But the trick makes that program actually invalid as it reduces the number of true binary tree nodes by half and instead just wraps the references (including null references) with a (memory-wise) zero-overhead wrapper.

I have demo programs that show the (close to?) zero-overhead wrapper actually has (close to?) no overhead. They are attached in this comment: #427 (comment 233741)

Update 2:

After all the discussions here, I propose to add following rule to existing benchmark rules:

trees must be allocated on the heap, even when depth == 0 (i.e. then it's just a root node with no children)

Such rule (in combination with the existing ones, i.e. Leaf nodes must be the same as interior nodes - the same memory allocation) will ensure that all implementations (irrespective of chosen memory management type) do the same number of heap allocations.

Edited Mar 31, 2021 by Piotr Tarsa