INDEX
Explanations
vocabulary related to morality and morals
New Auto-Interp
Negative Logits
xual
-1.18
rams
-1.11
essions
-1.10
lers
-1.09
Pavilion
-1.06
hips
-1.04
kt
-1.00
WER
-1.00
hw
-1.00
gow
-0.99
POSITIVE LOGITS
hazard
1.41
compass
1.38
istic
1.36
istically
1.29
equival
1.24
ising
1.23
conscience
1.21
indignation
1.21
dile
1.18
ised
1.17
Activations Density 1.139%