INDEX
Explanations
references to hate or hate-related concepts
New Auto-Interp
Negative Logits
erable
-0.17
erie
-0.15
éĥİ
-0.15
.scalablytyped
-0.15
ettle
-0.14
thora
-0.14
заÑģÑĤ
-0.14
lover
-0.14
Martial
-0.14
ieur
-0.14
POSITIVE LOGITS
speech
0.39
fully
0.32
Speech
0.32
crime
0.32
Speech
0.28
crimes
0.27
speech
0.27
peech
0.26
fulness
0.26
crime
0.25
Activations Density 0.010%