INDEX
Explanations
statements discussing moral or ethical judgments
phrases indicating moral judgments and ethical considerations
New Auto-Interp
Negative Logits
76561
-0.70
iping
-0.68
acan
-0.66
nants
-0.66
arta
-0.64
phabet
-0.62
Advertisements
-0.62
rongh
-0.61
jong
-0.61
mage
-0.60
POSITIVE LOGITS
raining
0.88
ceivable
0.73
impossible
0.71
coincidence
0.70
folly
0.69
advisable
0.64
EC
0.63
to
0.62
ifiable
0.61
conceivable
0.60
Activations Density 0.378%