Author | Lightnews

Natasha Johnson @natashamarie330.bsky.social · 4h

I’ll be presenting this work in **2 hours** at EMNLP’s Gather Session 3. Come by to chat about fanfiction, literary notions of similarity, long-context modeling, and consent-focused data collection!

Natasha Johnson @natashamarie330.bsky.social · 4h

Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!

Figure showing a similarity comparison between three stories. Story A and story B have the same author, and story A and story C have the same tone. A human might care about which stories are tonally the most similar, but a language model's notion of similarity is strongly informed by surface-level features like small differences in writing style across authors.

1 4

Natasha Johnson @natashamarie330.bsky.social · 4h

This was joint work with @abertsch.bsky.social, Maria-Emil Deal, and @strubell.bsky.social
Paper: arxiv.org/abs/2510.20926
Dataset: huggingface.co/datasets/fic...

FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction

As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the...

arxiv.org

2 6

Natasha Johnson @natashamarie330.bsky.social · 4h

Even strong embedding models over-index on surface features—for every model tested, similarity scores are more reflective of author or fandom than semantic aspects like theme or characterization. This is true even if models are explicitly instructed to focus on these aspects!

The performance (Spearman's rank correlation coefficient) of a number of embedding models across fine-grained semantic categories and superficial categories like author name. All models perform far worse on fine-grained categories than superficial categories. Explicit prompting for the category of interest is ineffective.

1 4

Natasha Johnson @natashamarie330.bsky.social · 4h

All selected fanfiction has detailed metadata and author-generated tags describing the fanfic content. Informed by fan studies and digital humanities literature, we classify these into 12 categories to construct gold labels for a fine-grained semantic similarity task.

A screenshot of Archive of Our Own's story metadata for one of the stories in FicSim, annotated for different types of similarity. Some fields (like content rating and category) are always assigned to categories like style or relationship dynamic, while other groups of tags are classified individually by annotators.

1 1

Natasha Johnson @natashamarie330.bsky.social · 4h

We introduce FicSim, a dataset of 90 recently written long-form fanfics from Archive of Our Own. We *reach out to the authors for permission* to use each work and prioritize continual, informed author consent. Fics range in length from 10K to 400K+ words.

Histogram of story length, ranging from 10 thousand to over 400 thousand words. Most stories are between 10 to 90 thousand words.

1 7

Natasha Johnson @natashamarie330.bsky.social · 4h

Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.

Can embedding models capture this? We study this in the context of fanfiction!

1 7 23