INDEX
Explanations
phrases related to a negative attitude towards certain behaviors or individuals
phrases that express respect or concern toward various subjects
New Auto-Interp
Negative Logits
breaks
-0.72
quad
-0.71
gallery
-0.71
NAS
-0.69
edin
-0.69
ns
-0.69
wcs
-0.69
Tracker
-0.68
sts
-0.67
daq
-0.66
POSITIVE LOGITS
conformity
0.86
oneself
0.77
mortality
0.74
rationality
0.71
autonomy
0.70
criminality
0.68
quo
0.65
populated
0.65
labour
0.65
antiqu
0.64
Activations Density 0.256%