INDEX
Explanations
references to male or female characters or individuals
New Auto-Interp
Negative Logits
Notion
-0.81
aarrggbb
-0.78
kasarigan
-0.76
soot
-0.71
things
-0.71
شهاد
-0.71
Sitten
-0.69
YAP
-0.68
THINGS
-0.67
Machu
-0.67
POSITIVE LOGITS
Male
1.24
MALE
1.18
MALE
1.11
male
1.11
FEMALE
1.05
Male
1.02
males
1.00
emale
0.98
Females
0.97
female
0.96
Activations Density 0.080%