INDEX
Explanations
mentions of specific groups or organizations
the end of the document or sections within it
New Auto-Interp
Negative Logits
ADE
-0.75
IDER
-0.67
lect
-0.64
ULT
-0.63
onne
-0.62
STD
-0.61
emetery
-0.61
Animal
-0.59
Vaugh
-0.59
avorite
-0.59
POSITIVE LOGITS
chery
0.86
ullah
0.82
bh
0.79
chet
0.79
rina
0.75
nikov
0.74
ifa
0.74
rance
0.72
amiya
0.71
ronics
0.71
Activations Density 0.025%