ago
0 like 0 dislike
0 like 0 dislike
This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes.

I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images.

Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence.

\*\*\*posted the same question in PyTorch forums, reposting here for wider reach.
ago
0 like 0 dislike
0 like 0 dislike
I managed to use SwAV on 1 GPU (8GB), batch size 240, 224x224 images, FP16, ResNet18.

Of course it works, the problem isn't just the batch size but the accuracy - batchsize trade-off, and the accuracy was quite bad (still usable for my task though). If 50% top5 on imagenet is ok for you, you can do it. But I'm not sure there are many tasks where it makes sense.

Perhaps contrastive learning isn't the best for single GPU. I'm not sure about the current SOTA on this task.
ago
0 like 0 dislike
0 like 0 dislike
How about mocov2? That should work on a single gpu
ago
0 like 0 dislike
0 like 0 dislike
If you are willing to trade time for batch size you can try with gradient accumulation
ago
0 like 0 dislike
0 like 0 dislike
Barlow twins, maybe? Easy to implement and batch size effective.
ago
by
0 like 0 dislike
0 like 0 dislike
cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function

so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size
ago
by
0 like 0 dislike
0 like 0 dislike
VICReg
ago
by
0 like 0 dislike
0 like 0 dislike
Is there any reason you'd like a contrastive algorithm? (intra-class discrimination?)

Barlow twins showed to work quite well with lower batches (32) and HSIC-SSL is a nice variant on this style of learning if you only care about clusters. Im sure simsiam is fine too (avoid BYOL for small batches).

In terms of contrastive approaches, methods that avoid any "coupling" mentioned in DCL for the negative terms will work with smaller batch sizes (contrastive estimates converge to mle assuming large noise samples). This is seen in the spectral algorithm or in align-uniform. These work because they ignore the comparing the representations from the same augmented samples. SWAV also does this by contrastive prototypes which are basically free variables which don't have gradients that conflict with any alignment goal. I think it's fair to say that algorithms with LSE transforms are less stable for small batch sizes since the gradients will be biases to randomly coupled terms. With sufficiently many terms this coupling matters less.

From what i've noticed, methods that avoid comparing the augmented views of the same base sample will require slightly more tuning to get things just right. (align + weight \* diversity)

​

Notes: NNCLR is nicer than moco imo. VicReg is good but is a mess to finetune. I am assuming youre using a CNN and have omitted transformer and masked based algorithms.
ago

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!