INDEX
Explanations
mentions of individuals by their gender-neutral pronouns
references to individuals and their roles or achievements
New Auto-Interp
Negative Logits
Untitled
-0.73
bothering
-0.71
Battery
-0.63
Enlarge
-0.62
mma
-0.61
âĺħâĺħ
-0.60
âĢ¢âĢ¢
-0.60
pregn
-0.60
Leaks
-0.60
cliché
-0.60
POSITIVE LOGITS
also
0.94
pherd
0.91
theless
0.88
miah
0.88
consists
0.86
consisted
0.84
comprises
0.83
graduated
0.81
ffield
0.80
'll
0.79
Activations Density 0.305%