I'd say try the nanogpt speedrun. It's much easier to train, and gives you a bet...

naasking · 2026-01-19T02:13:06 1768788786

The linked paper tested nanoGPT with this new transformer:

https://www.techrxiv.org/users/685780/articles/1375955-topol...

tuned · 2026-01-19T06:14:23 1768803263

thanks for linking.

Yes the paper compares the new architecture (that is also a fork of my implementation of nanoGPT) with Karpathy's nanoGPT. There are also links to the code and bench used.

Herring · 2026-01-19T18:07:14 1768846034

Note I didn't say Karpathy's nanoGPT, I said use the speedrun.

Transformers are universal function approximators. When well-tuned, they often start to approximate other innovations. Not always, thank god, but often enough that you have to be careful.

tuned · 2026-01-22T09:28:14 1769074094

ok, thanks. I am taking it slow then

nickpsecurity · 2026-01-19T01:27:34 1768786054

Labs were also competing to train BERT's for $20 or less. People still use them a lot, too.

https://www.databricks.com/blog/mosaicbert

I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.