INDEX
Explanations
phrases related to morality and ethical considerations
terms related to ethics and moral considerations
New Auto-Interp
Negative Logits
xual
-0.92
-+
-0.72
oday
-0.68
Twice
-0.66
anian
-0.66
gery
-0.65
mble
-0.64
nces
-0.64
ptives
-0.63
lv
-0.63
POSITIVE LOGITS
compass
1.15
istic
1.08
indignation
1.05
izing
1.04
relat
1.00
dile
0.99
hazard
0.97
equival
0.96
ising
0.94
IZE
0.92
Activations Density 0.056%