PAPERS

Revisiting Self-Supervised Visual Representation Learning

November 3, 2019

Problem:

Paper deals with aspects of self-supervised learning that have been not researched thouroughly or recent enough.

Self Supervision:

Framework for creating supervised signal automatically in order to learn representations that will be useful for downstream tasks
Requires only unlabled data in order to formulate a pretext learning task such as predicting context
Pretext tasks must be designed in such a way that high level understanding is useful for solving them

Architecture of CNN models:

Paper evaluates self-supervised technique in the image domain and hence focusses on the use of CNNs.
Uses variants of ResNet and a batch normalized VGG architecture.

Self-supervised techniques

Rotation: Produce 4 copies of single image by rotating it by {0 \(^\text{o}\) , 90\(^\text{o}\), 180\(^\text{o}\), 270\(^\text{o}\)} and tasks the model with classifying the rotation.
Exemplar: Individual images correspond to their own class and other members of the class are generated by heavy random data augmentation such as translation, scaling, rotation and contrast and color shifts. Uses a triplet loss whcih encourages examples of the same image to have representations that are close in the Euclidean space
Jigsaw: Model has to recover relative spatial position of 9 randomly sampled iamge patches after a random permutation of these was performed
Relative patch location: Model is similar to the Jigsaw one. Receives two patch of an image and needs to predict one of the 8 possible spatial relations between the two patches

Evaluation of Learned Visual Representations:

The learned representations are evaluated by using them in downstream tasks. The tasks used in this paper are multiclass image classifcation tasks.
Datasets used are ImageNet and Places205

Experimental Results

Similar models often learn self-supervised visual representations that make them significantly different in performance
The rankings of architectures across the self-supervised tasks is not consistent. The ranking of methods across the architectures is also not consistent
One clear observation is that increasing number of channels in CNN models imrorves performance of the self-supervised models. This is similar to supervised model methods

Observations:

Better performance on pretext task does not always translate to better representations
Good performance is useful as an evaluation of potential of the model but only after the model has been fixed.
This can't be used to reliably select the model architecture
Skip connection prevent degradation of representation quality towards the ends of CNNs.
- Authors believe this happens because the model overfit to the pretext task in the later layers and disscard more general semantic features present in the middle layers
- This holds true only for VGG and not for ResNet. Authors believe that this is a result of ResnNet's residual units being invertible under some conditions
Model width and representation size positively influence the representation quality when the sizes are increased.