It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.
I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta
I tried to apply SNMLM's ideas to the byte-level prediction modeling here: https://github.com/thesz/snmlm-per-byte
It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.
I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta