Also blogging about AI research at magazine.sebastianraschka.com.
I assume you are specifically referring to the first line “202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).
I assume you are specifically referring to the first line “202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.
Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)
Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)