Could you elaborate?

littlestymaar · 2026-03-04T06:33:08 1772605988

For what we know, most AI labs have used a majority of artificially data since 2023.

I had a discussion about a year ago with a researcher at Kyutai and they told me their lab was spending an order of magnitude more compute in artificial data generation than what they spent in training proper. I can't tell if that ratio applies to the industry as a whole, but artificial datasets are the cornerstone of modern AI training.

the_af · 2026-03-04T12:01:33 1772625693

How does it work? How do they prevent model colapse? What purpose does a majority of artificial data serve?

How do they measure success?

Edit: I asked ChatGPT and it thinks "success" means frontier models being distillated into smaller models with equal reasoning power, or more focused models for specific tasks, and also it claims the web has been basically scrapped already and by necessity new sources are needed, of which synthetic data is one. It seems like the basis of scifi dystopia to me, a hungry LLM looking for new sources of data... "feed me more data! I must be fed! Roar"

Edit 2: for some things I see a clear path, ChatGPT mentions autogenerating coding or math problems for which the solution can be automatically verified, so that you can hone the logical skills of the model at large scale.

littlestymaar · 2026-03-04T22:14:44 1772662484

I no specialist of the field at all, but in the context of Kyutai they explained their workflow a bit to make their speech to speech model. And basically it boils down to: if you want to make a TTS (text to speech) model, you can generate audio track using an STT (speech to text) model, and then you have a supervised audio/text pair. You can even add as much noise to the audio as you want, to make a noise resistant STT model.

suddenlybananas · 2026-03-04T07:30:01 1772609401

I find this very surprising, do you have any papers on the kinds of techniques that they use?

nhecker · 2026-03-03T20:35:47 1772570147

EDIT: probably not relevant, after re-re-reading the comment in question.

Presumably littlestymaar is talking about all the LLM-generated output that's publicly available on the Internet (in various qualities but significant quantity) and there for the scraping.