INDEX
Explanations
instances of the word "left" and related directional terms
New Auto-Interp
Negative Logits
risk
-0.17
llib
-0.17
uw
-0.15
-hide
-0.14
appa
-0.14
rh
-0.14
ibar
-0.14
qualified
-0.14
hop
-0.14
ewire
-0.13
POSITIVE LOGITS
most
0.19
-hand
0.18
ness
0.17
tings
0.16
/right
0.16
bies
0.16
-leaning
0.16
sy
0.15
-wing
0.15
ISTS
0.15
Activations Density 0.041%