INDEX
Explanations
pronouns referring to males
New Auto-Interp
Negative Logits
Doing
0.50
Doing
0.46
οδο
0.46
CNT
0.40
doing
0.39
oman
0.39
ทำ
0.37
làm
0.37
ώσεις
0.37
Doom
0.36
POSITIVE LOGITS
adou
0.43
"__
0.41
Iwas
0.39
handled
0.39
Frey
0.38
बराबर
0.38
指
0.38
Fred
0.37
fing
0.37
ἆ
0.37
Activations Density 0.000%