@pavlinpolicar.bsky.social
12 followers 8 following 8 posts
Posts Media Videos Starter Packs
Reposted
hippopedoid.bsky.social
We spent a year writing this review of low-dim embeddings and arguing about things like epistemic roles and best practices :-) 20+ authors are all participants of the Dagstuhl seminar we held last year: www.dagstuhl.de/24122. Led by @alexandr.bsky.social and Cyril de Bodt.

arxiv.org/abs/2508.15929
Reposted
alexandr.bsky.social
Last year I met a bunch of great researchers who work with high-dimensional data at a Dagstuhl seminar. This week we put out a preprint about the history and philosophy of low-dimensional embedding methods, their applications, their challenges, and their possible future arxiv.org/abs/2508.15929
The participants of Dagstuhl Seminar 24122 standing on steps outside (from https://www.dagstuhl.de/24122) Multiple types of embeddings (UMAP, t-SNE, Laplacian Eigenmaps, PHATE, PCA, MDS) of Wikipedia text data labelled by a text summaries generated by an LLM. Methods like UMAP and t-SNE show cluster structure that reflect shared subject matter in text, whiel other methods show more continuous structure. Multiple embedding methods (PCA, Laplacian Eigenmaps, t-SNE, MDS, PHATE, UMAP) of primate brain organoids at different time periods. Different methods highlight different aspects of development, such as clusters of similar cell types or time courses of cell development. Multiple embedding methods (PCA, Laplacian Eigenmaps, t-SNE, MDS, PHATE, UMAP) of 1000 Genomes Project genotypes. Different methods reflect different aspects of demographic history of populations.
pavlinpolicar.bsky.social
One encouraging takeaway from the study is that, at least in this particular task, open-source LLMs prove just as capable as OpenAI's commercial GPT4o. This means that universities could run their own LLM graders in-house without fear of compromising student privacy. 7/
pavlinpolicar.bsky.social
So, are we human TAs obsolete?

Well, not quite.

First, setting up good grading rubrics takes quite a bit of time and effort. Second, LLMs achieved an accuracy of 90%, which could still be improved. However, perhaps newer models will perform even better! 6/
pavlinpolicar.bsky.social
In terms of feedback, students actually seem to slightly *prefer* feedback written by LLMs over human-written feedback. While there is some nuance to this result, the conclusion is clear: students are just as happy with LLM-generated feedback as with TA-written feedback. 5/
pavlinpolicar.bsky.social
In our setup, LLMs determined whether answers satisfied predefined grading criteria that we, TAs, painstakingly prepared ahead of time. Here, LLMs achieve about ~90% accuracy. Small LLMs work well on easier questions but are overly generous for harder and open-ended questions. 4/
pavlinpolicar.bsky.social
We tested several models, including OpenAI's GPT4o and several open-source Llama 3 models of varying sizes. So, can LLMs grade student assignments?

The short answer is "mostly yes".

There are two aspects to grading student answers: the grade and the feedback. 3/
pavlinpolicar.bsky.social
We wanted to see whether LLMs could grade short text answers as well as (or better) than human TAs. Over the course of the semester, students answered 36 questions of varying difficulty, and their answers were randomly assigned to be graded by a human TA or an LLMs. 2/
pavlinpolicar.bsky.social
Can LLMs replace human teaching assistants in grading homework assignments?
This semester, we ran a study in our Bioinformatics course with 100+ enrolled master's students, whose assignments were graded by 6 LLMs. 1/