INDEX
Explanations
instances of social norms and practices
New Auto-Interp
Negative Logits
lÃŃ
-0.17
owitz
-0.16
adb
-0.16
QuáºŃn
-0.14
Arbitrary
-0.14
оже
-0.14
Zero
-0.14
kJ
-0.14
еÑĢк
-0.13
ication
-0.13
POSITIVE LOGITS
even
0.24
almost
0.20
even
0.19
sometimes
0.17
даже
0.17
sogar
0.17
almost
0.17
pais
0.16
actually
0.15
EVEN
0.15
Activations Density 0.054%