INDEX
Explanations
words related to legal matters and medical conditions
references and terminology related to gender and social roles
New Auto-Interp
Negative Logits
Kan
-1.01
Kap
-0.99
Gan
-0.92
Kahn
-0.87
Mull
-0.87
Joan
-0.86
Stefan
-0.86
UX
-0.85
Khan
-0.84
Guth
-0.83
POSITIVE LOGITS
ears
1.14
ĵ
0.93
paralle
0.93
orse
0.91
orses
0.87
ear
0.87
oren
0.86
inder
0.86
ory
0.85
eryl
0.85
Activations Density 0.413%