INDEX
Explanations
references to male individuals, particularly using the word "guy."
New Auto-Interp
Negative Logits
Eber
-0.73
———-
-0.69
AER
-0.67
“
-0.66
=”
-0.66
Kear
-0.64
くると
-0.63
Beg
-0.63
d
-0.63
kso
-0.62
POSITIVE LOGITS
guys
1.75
Guys
1.75
guys
1.74
GUYS
1.70
Guys
1.70
GUY
1.61
guy
1.55
Guy
1.52
guy
1.43
Guy
1.43
Activations Density 0.058%