www.startribune.com/she-was-an-a...
www.startribune.com/she-was-an-a...
lexichron is a work-in-progress but stable for many use cases. Check out the README and included Jupyter notebooks for more info..
I welcome your feedback!
lexichron is a work-in-progress but stable for many use cases. Check out the README and included Jupyter notebooks for more info..
I welcome your feedback!
Analysis resources:
- Easily compute cosine similarities and plot temporal trends
- Track and visualize semantic drift for individual words
- Compute WEATs
Analysis resources:
- Easily compute cosine similarities and plot temporal trends
- Track and visualize semantic drift for individual words
- Compute WEATs
Training and evaluation pipelines:
- Hyperparameter tuning: architecture (SGNS vs. CBOW), vector size, context window, weighting strategy, etc.
- Model evaluation using similarity and analogy benchmarks
- Visualizations comparing model quality
- Regressions quantifying hyperparameter impact
Training and evaluation pipelines:
- Hyperparameter tuning: architecture (SGNS vs. CBOW), vector size, context window, weighting strategy, etc.
- Model evaluation using similarity and analogy benchmarks
- Visualizations comparing model quality
- Regressions quantifying hyperparameter impact
Purpose-built for HPC clusters and cloud-computing. Pipelines that take weeks on a laptop run in minutes or hours on a cluster. (Fewer cores? It'll work, just slower.)
lexichron truly shines with 30+ CPUs, 80+ GB RAM, and fast NVMe SSD storage.
Purpose-built for HPC clusters and cloud-computing. Pipelines that take weeks on a laptop run in minutes or hours on a cluster. (Fewer cores? It'll work, just slower.)
lexichron truly shines with 30+ CPUs, 80+ GB RAM, and fast NVMe SSD storage.
Key features:
- Tunable preprocessing, including case normalization, lemmatization, stop-word removal, and spell-checking
- Vocabulary whitelisting for efficient filtering
- Bigram preservation, allowing retention of semantically interesting word pairs
- RocksDB databases for fast queries
Key features:
- Tunable preprocessing, including case normalization, lemmatization, stop-word removal, and spell-checking
- Vocabulary whitelisting for efficient filtering
- Bigram preservation, allowing retention of semantically interesting word pairs
- RocksDB databases for fast queries
Both data-prep pipelines feed into a single word2vec training module (gensim's implementation) for training, normalizing, and aligning yearly models to capture semantic change.
Both data-prep pipelines feed into a single word2vec training module (gensim's implementation) for training, normalizing, and aligning yearly models to capture semantic change.
lexichron implements two data-prep pipelines:
- Google Ngrams: download and filter 1–5 grams in 8 languages
- Mark Davies' corpora: process datasets from English-Corpora.org containing year and genre metadata (e.g., COHA, COCA; this requires a license and access to the corpus files.)
lexichron implements two data-prep pipelines:
- Google Ngrams: download and filter 1–5 grams in 8 languages
- Mark Davies' corpora: process datasets from English-Corpora.org containing year and genre metadata (e.g., COHA, COCA; this requires a license and access to the corpus files.)
lexichron is for tracking conceptual shifts across years and decades. It efficiently prepares, filters, trains models on, and analyzes results from extremely large corpora.
Although it can be tuned for personal computers, lexichron is designed for parallelization on HPCs and cloud platforms.
lexichron is for tracking conceptual shifts across years and decades. It efficiently prepares, filters, trains models on, and analyzes results from extremely large corpora.
Although it can be tuned for personal computers, lexichron is designed for parallelization on HPCs and cloud platforms.