INDEX
Explanations
names of individuals associated with controversial topics or issues
New Auto-Interp
Negative Logits
iP
-0.86
RELE
-0.84
IMP
-0.80
INTER
-0.77
acron
-0.77
ASC
-0.76
supers
-0.74
DIRECT
-0.71
SAN
-0.71
CONTR
-0.69
POSITIVE LOGITS
inar
1.12
idy
1.10
aga
1.09
axis
1.05
ady
1.00
ipal
1.00
ucket
1.00
arie
0.98
ilus
0.97
actor
0.97
Activations Density 2.075%