ago
0 like 0 dislike
0 like 0 dislike
If you are building a model and had the choice, would you prefer more accurate (~99%) but less data or a lot more data but less accurate (~90%)?
ago
0 like 0 dislike
0 like 0 dislike
Generally more accurate data is better, especially if you have a small compute budget.

There is a limit to this. For example the Wikipedia dataset is more accurate than the general scraped internet, but LLMs are still trained on scraped internet data because there's orders of magnitude more of it.
ago
0 like 0 dislike
0 like 0 dislike
If you can collect a lot of data, you can use self-supervised learning to learn from the massive unlabeled data. Then you can fine-tune that model with the few accurate labeled data. This is how you benefit from both to maximize performance.
ago
0 like 0 dislike
0 like 0 dislike
It depends how much more data and how much more noise.

Take some model with sqrt(n) convergence, for example. Maybe your error goes like noise/sqrt(n), which means for the same error, the noise vs data tradeoff goes as:

const \* noise' / sqrt(n') = error  = const \* noise / sqrt(n)

\--> sqrt(n') / sqrt(n) = noise' / noise

\--> n' / n = (noise' / noise)\^2

So 4x the data for 2x the noise to come out even. If you offer me 10x the data for 2x the noise, I've come out ahead. If you offer me 2x the data for 2x the noise, I've come out behind.

That all assumes a particular convergence rate, of course. YMMV (your model may vary)
ago
0 like 0 dislike
0 like 0 dislike
You need 100 times more data to make up for the 10 times loss in precision. Let's not even talk about sample biases.
ago

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!