INDEX
Explanations
references to appearance or visual assessment
New Auto-Interp
Negative Logits
utra
-0.20
874
-0.16
somew
-0.15
нки
-0.14
somewhere
-0.14
iar
-0.14
chor
-0.14
ugar
-0.14
Carp
-0.13
ovation
-0.13
POSITIVE LOGITS
like
0.51
like
0.44
Like
0.43
Like
0.42
LIKE
0.38
LIKE
0.37
-like
0.36
likes
0.35
_like
0.33
.like
0.31
Activations Density 0.013%