INDEX
Explanations
the word "against" and phrases indicating direction and impact
New Auto-Interp
Negative Logits
ValueStyle
-0.63
orteur
-0.63
habet
-0.59
ambito
-0.57
locomotion
-0.57
UTERS
-0.56
Distribuzione
-0.56
لينكات
-0.55
writeField
-0.55
ladle
-0.55
POSITIVE LOGITS
Against
0.69
wall
0.66
+#+#
0.64
Against
0.62
Wall
0.58
AGAINST
0.58
against
0.57
tingly
0.57
against
0.56
WALL
0.56
Activations Density 0.019%