INDEX
Explanations
mentions of specific groups or categories within a broader context
New Auto-Interp
Negative Logits
nery
-0.70
ysc
-0.69
oldemort
-0.64
prol
-0.62
ober
-0.62
irez
-0.61
idation
-0.60
anytime
-0.59
idia
-0.58
adal
-0.58
POSITIVE LOGITS
st
0.95
IJ
0.88
Īè
0.87
Ĭ±
0.87
ĪĴ
0.84
stad
0.82
ī
0.79
Ĥª
0.79
among
0.76
eteen
0.75
Activations Density 1.317%