INDEX
Explanations
pronouns that refer to "he" or "she" at high activation levels
references to gender pronouns, specifically "he" and "she."
New Auto-Interp
Negative Logits
Joy
-0.85
Vive
-0.71
Dam
-0.70
rar
-0.69
Sov
-0.66
Vil
-0.65
1080
-0.65
Cruiser
-0.65
Js
-0.64
Squid
-0.63
POSITIVE LOGITS
itage
0.75
self
0.73
own
0.71
ãĤ´ãĥ³
0.71
initials
0.71
itant
0.67
gdala
0.66
agher
0.64
owe
0.64
fate
0.63
Activations Density 0.058%