There’s many examples of noisily encoding a large embedding vocabulary. This sounds a bit like T-free or H-net? Or BLT?
One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute.
One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute.