INDEX
Explanations
mentions of specific people
the presence of specific pronouns and references to individuals, particularly focusing on female pronouns
New Auto-Interp
Negative Logits
igham
-0.67
reach
-0.65
¿½
-0.65
iHUD
-0.60
izo
-0.58
ender
-0.58
Allied
-0.58
isky
-0.55
hod
-0.55
reperc
-0.55
POSITIVE LOGITS
ï¸ı
0.80
xual
0.71
Ö¼
0.69
respectively
0.68
в
0.66
anwhile
0.65
к
0.63
sic
0.62
arine
0.62
erent
0.61
Activations Density 0.623%