Christopher Akiki
@cakiki.bsky.social
1.2K followers 69 following 31 posts
research scientist at ScaDS.AI Leipzig in nlp, ir, and ml. @hf.co fellow. @lichess.org team member. @kaggle.com datasets expert.
Posts Media Videos Starter Packs
Reposted by Christopher Akiki
lichess.org
We're cooking.. 👀
Photo of three hard disk drives.
cakiki.bsky.social
526.9 million player deaths in 24.7 million levels of Super Mario Maker 2. Data by @tgr.bsky.social
very colorful scatterplot of player deaths in mario maker levels. "inferno" colormap. some level elements highlighted.
cakiki.bsky.social
Really cool new embeddings exploration tool by @domoritz.de and colleagues from Apple. Can't wait to build with this. Also includes a streamlit component and a Jupyter widget.
screenshot from the embeddings atlas repo example density plot exploring a wine dataset.
cakiki.bsky.social
Would you happen to have your university lectures on OT anywhere online?
cakiki.bsky.social
We are grateful for your sacrifice 🫡
cakiki.bsky.social
Licensing is weird though, they say it's GPL but also include this: "To use the compiled binaries, you must own the game".
cakiki.bsky.social
Woah! EA just open sourced "Command and Conquer: Red Alert" and a bunch of other CnC games! github.com/electronicar...
box art from 1996 version of CNC RED ALERT
cakiki.bsky.social
This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.

arxiv.org/abs/2407.16607
We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of
redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges
_ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every
token is unique, it is impossible for more than one of these merges to ever be used, and we empirically
verify this by applying the tokenizer to a large amount of text.
We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers
format. To construct the merge list, the conversion algorithm naively combines every pair of tokens
in the vocabulary, and then sorts them by token ID, which represents order of creation. While this
is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e.,
they do not actually represent the most-likely next-merge), we need to remove them to apply our
algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we
record the path of merges that achieves each merge; the earliest path is the one that would be taken,
so we keep that merge and remove the rest.
As an aside, this means that a tokenizer’s merge list can be completely reconstructed from its
vocabulary list ordered by token creation. Given only the resulting token at each time step, we can
derive the corresponding merge.
cakiki.bsky.social
Shouldn't "l" and "o" both still be part of the vocab along with "lo" after the merge? Vocab size should grow, not shrink.
Reposted by Christopher Akiki
lichess.org
Lichess is now on @kaggle.com!

Use our puzzles, openings, and engine evaluation datasets directly in your kaggle notebooks: https://www.kaggle.com/organizations/lichess ♟️
Screenshot of Lichess organization overview page on Kaggle
cakiki.bsky.social
The folks at Foursquare released a @hf.co dataset of 104.5 million places of interest and here's all of them plotted using datashader
world map with points of interest
cakiki.bsky.social
I'm currently working on an interactive version which will hopefully answer some of these questions! An initial cluster analysis didn't align well with the puzzle labels that are included in the dataset.
cakiki.bsky.social
It's part of an ongoing paper project which I hope to open source very soon!
cakiki.bsky.social
I recently used the @lichess.org puzzles dataset to experiment with chess position embeddings and visualize 4.5M starting positions. (hf.co/datasets/Lic...)
Scatterplot with 4.5M data points. Inferno colormap. Looks somewhat organic with neat clusters.
Reposted by Christopher Akiki
lichess.org
The Lichess database of games, puzzles, and engine evaluations is now on @hf.co - https://huggingface.co/Lichess. Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! ♟️ 🤗
cakiki.bsky.social
It's also very similar to Julia's Pluto.jl which is itself a great tool.
cakiki.bsky.social
Would love to hear more about your setup!
cakiki.bsky.social
Early experiment visualizing of Cohere For AI's newly-released Aya dataset. Multilingual corpora are always so fun to play with.
Scatterplot of the Aya dataset. 250K points in a floral / broccoli like pattern. Inferno color map.
cakiki.bsky.social
I used the datashader library to plot 100M points that follow the pictured formula. The shading shows how "busy" with points a given location is.
cakiki.bsky.social
Clifford-inspired strange attractor.
Black and white image of a strange attractor.