INDEX
Explanations
words related to moral values and ethics
references to moral concepts and ethical discussions
New Auto-Interp
Negative Logits
xual
-0.83
rams
-0.70
-+
-0.69
Lup
-0.69
nces
-0.68
Twice
-0.68
upon
-0.67
WER
-0.66
gow
-0.65
minster
-0.65
POSITIVE LOGITS
istic
1.15
izing
1.12
ising
1.09
hazard
1.03
compass
1.03
ised
1.00
indignation
0.99
izational
0.97
equival
0.96
IZE
0.95
Activations Density 0.034%