James Tauber
@jtauber.com
6.9K followers 4.5K following 2.7K posts
Using computers to better understand languages, texts, and music. Python, Web, Corpus Linguistics, Data Visualization, Philology, Ancient Greek, Music Theory, Tolkien, Space, Health. Perseus DL, Greek Learner Texts Project & Digital Tolkien Project!
Posts Media Videos Starter Packs
Pinned
jtauber.com
Intro for new followers: I'm a long-time (i.e. old) Python and Web developer. Now mostly apply that to digital humanities and corpus linguistics with focus on historical languages (especially Ancient Greek) and Tolkien. Also education, data visualization, music theory, and a handful of other things.
jtauber.com
Of the 12,450 deduplicated disk images of the right size, 7,092 (57.0%) of them appear to be normal DOS 3.3 disks with a regular VTOC.
jtauber.com
The type-token-ratio of natural language can vary considerably by text length (and for a given text window it's typically one of the inputs to register analysis).

@willwhim.com called it, though: once you ignore the nulls, it's pretty Zipfian.
jtauber.com
Good call! I'd say so.
log-log scatter plot, mostly following a straight line until the tail end
jtauber.com
Every time I think it was a specific thing I did, I check and the person just blocked thousands of people at once :-)
jtauber.com
Okay, the LEAST COMMON byte value across the current Apple II disk image corpus is $9B.

The top five from fifth most common to most common:

$80
$FF
$A0
$20

and by a whopping amount (the distribution is NOT Zipfian)

$00
jtauber.com
Even before I extract files from these disk images, make any distinction between code and data, or do any disassembly does anyone want to make predictions about the distribution of byte values, repeated n-grams, etc?
jtauber.com
gotta do SOMETHING while I should be packing :-)
jtauber.com
Of those 13,252 I can eliminate a further 803 as having a duplicate MD5 hash.
jtauber.com
Of the 13,332 files I identified for download, 13,312 successfully downloaded and 13,252 are the expected size: 143,360 bytes.
jtauber.com
Even before I extract files from these disk images, make any distinction between code and data, or do any disassembly does anyone want to make predictions about the distribution of byte values, repeated n-grams, etc?
jtauber.com
Seeing all the names of the crackers in the filenames reminds me that, even though I never cracked any disks myself and certainly never did any network cracking, I decided at around twelve that my cracker name was The Acorn.
jtauber.com
here's a full list if anyone is interested: github.com/jtauber/CARC...
github.com
jtauber.com
currently half-way through the download :-)
jtauber.com
I've identified 13,322 Apple II disk images to download.
jtauber.com
Thanks! Had a lot of fun with the visuals
jtauber.com
the 6809 was my first love
jtauber.com
It's very DH-sounding but I'm currently thinking about extracting and cleaning up machine code files from the Asimov Apple II disk image archive.
jtauber.com
Z80 is definitely the second processor I'd do
jtauber.com
I have to say I'm darn proud of that acronym :-)
jtauber.com
Okay, I've started a repo (and even came up with a Tolkien-inspired acronym)! If you're interested, star / watch the repo. I'll start some discussions there too.

github.com/jtauber/CARC
GitHub - jtauber/CARC: Corpus Analysis of Retro Code
Corpus Analysis of Retro Code. Contribute to jtauber/CARC development by creating an account on GitHub.
github.com
jtauber.com
Okay. I'd going to start with 6502 machine code.

Time to build a corpus!
jtauber.com
I'm dead serious about pursuing this but I did chuckle at it giving a whole new meaning to "register analysis".
jtauber.com
I've done a lot of retro computing in my time, from implementing CPUs like the 6502 and Z80 in Python to analyzing the source code of operating systems and games like Ultima IV but something just occurred to me that I'm now very interested in exploring:

Applying corpus linguistics to old code.