I misunderstood what you said. Well, in your work, whay benefit did you get from...

thesz · 2026-01-20T01:03:58 1768871038

At the time (2018), it had perplexity close to LSTM, while having more coefficients and much shorter (hours vs days) training time.

I tried to apply SNMLM's ideas to the byte-level prediction modeling here: https://github.com/thesz/snmlm-per-byte

It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.

I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta