Brendan G Bohannon
bgbtech.bsky.social
Brendan G Bohannon
@bgbtech.bsky.social
Hobbyist programmer (C, Verilog, ASM, etc), CPU and ISA design, have my own C compiler, 3D and GL, ..., and sometimes electronics.
ASD/aspie, single (ace/demi), Gen Y. Currently lives in OK.
In this case, the prefix is basically a 64-bit instruction encoding whose sole purpose is to glue an additional 52 bits onto the immediate of the following instruction. Had to get a little tricky here as bits are tight in the RV encoding space... Mostly shares a similar pattern as J21I prefixes.
December 24, 2025 at 8:39 PM
While arguably not as effective for normal code generation (Imm64 rare, Abs64 not usually desirable), they allow branching and accessing memory from unrelated parts of the address space without stomping any registers (can be very useful for patches and thunks...). Still experimental for now...
December 24, 2025 at 8:35 PM
It depends, but even with a GPU, I often prefer NEAREST_MIPMAP_LINEAR or similar over conventional trilinear filtering. I don't mind aliasing, and sometimes even prefer the look of dithering. Theoretically the opposite of what someone "should" prefer...
November 20, 2025 at 7:00 AM
Yeah; HL1 one had more freedom to just sorta wander around in Black Mesa and gradually advance plot. HL2 was basically just going in a straight line the whole time, bigger spaces but mostly just used for vehicle sections, ... Much more linear so had a different feel.
November 20, 2025 at 6:53 AM
OK, by tweaking some parameters to my existing PNG's Deflate encoder, was able to get it to 30ms. Switched from checking the last 16 matches over 4K, to the last 4 matches over the last 16K (and a fairly significant speedup). Still slow, but a bit better at least, can live with 30ms...
November 16, 2025 at 6:05 PM
Granted, it spends a lot of time:
Choosing which filter to use (per-scanline);
(usually) choosing Paeth (also slow);
Deflating a big chunk of data (harder to make fast);
Could maybe look into speed tuning it some...
So: ~ 200ms for a 720x480 image (vs 9ms for said BMP).
November 16, 2025 at 5:11 PM
They are not usually dithered monochrome though, something else is going on...
November 16, 2025 at 9:26 AM
A lot of language features are missing (such as "switch()", etc), and filling in more holes will increase line count. In this case, it was partly an experiment to see if I could write a small/simple interpreter for such a language. It is a mystery if more could be done with less code...
October 30, 2025 at 7:16 AM
Hypothetically, one could use FP8 as the input, and then an internal intermediate 12b format, with log-scale adders. Should be possible to parallelize some of the logic provided one knows largest value. Output cost then mostly N-way 12b integer add. Still wouldn't be very accurate though...
October 27, 2025 at 4:36 PM
Yeah, the lack of cheap FP add is the problem here. There is a way to poorly approximate FP8 add by treating it as a log-scale value, which gets a little fiddly for subtract. But, not really accurate enough (still generally need FP16 for the accumulator IME...).
October 27, 2025 at 4:02 PM
Tyrell gets replicants to hold the potato slices in the deep fryer by hand. They live for this stuff, and it adds to the flavor...
October 4, 2025 at 10:17 AM
The new block format has two 8-bit E4.M4 values, 16 sign bits, and 4x4x2 interpolation bits. May be accessed as either 16 scalar values, or four 4-element vectors. Instruction will decode into Binary16 format. The current instructions use 64-bit encodings only for now...
July 29, 2025 at 6:28 PM
Well, I spoke too soon earlier... More optimization fiddling has gotten the performance for an "/O2" build now up to around 275 MIPs... Didn't really expect to be able to get it this much faster... Though, a few questionable/unorthodox things were done to pull this off.
June 6, 2025 at 12:55 AM
The interpreter is basically fast enough to keep Doom pegged at its framerate limiter (~ 34 fps typical). Depending, may be worth trying Quake or similar as another test case.
Faster (than JX2VM) mostly as it is userland only, and also isn't trying to be cycle-accurate...
June 5, 2025 at 7:21 PM
For now, I have Doom running in it as a test.
Performance currently seems to be: ~ 70 MIPs (MSVC, /Zi); ~ 140 MIPs (/Zi /O2) or ~ 170 MIPs (/O2), (MIPs = million emulated instructions/second). Plain C interpreter for now, making it much faster would likely require a JIT or similar (tradeoffs).
June 5, 2025 at 7:18 PM
In effect, the current idea is that both guest to host and host guest interactions will be handled using COM Object style interfaces. Though, the VM will use its own address space, so a few extra hassles exist. Though, shared-memory areas are the most likely option here.
June 1, 2025 at 5:20 AM
So, at this point, need to decide on whether to implement this API and allow (as a hack) Doom to be run inside the 3D engine (say, to see how fast this VM goes; and may also allow further debugging of the interpreter). Either way, will need to implement support for COM like interfaces from the VM.
June 1, 2025 at 5:16 AM
This interpreter is userland only (I am not calling it an emulator because of this; the VM and host program serve the role of the OS). Doom gets far enough along that it requests a TKGDI context (COM-like API; does sound and graphics stuff). But, maybe outside the scope of my 3D engine...
June 1, 2025 at 5:09 AM
Though, to be fair, even if one does go down and comment, much of the time (at least in my experience) YT just immediately makes it disappear into the ether (regardless of topic). Almost not worth trying to comment on anything on YT anymore...
May 22, 2025 at 1:43 AM