INDEX
Explanations
negative statements or contrasts
negative phrases and concepts regarding evidence and social issues
New Auto-Interp
Negative Logits
igm
-0.77
ipeg
-0.74
ovember
-0.66
Nare
-0.65
iatus
-0.64
WN
-0.64
uden
-0.63
affe
-0.62
Shan
-0.62
Notes
-0.61
POSITIVE LOGITS
asso
0.69
cause
0.66
decency
0.64
clot
0.63
slightest
0.63
la
0.61
ilings
0.60
hypocr
0.59
sensit
0.57
righteousness
0.57
Activations Density 0.261%