INDEX
Explanations
phrases related to doing the right thing
New Auto-Interp
Negative Logits
ahon
-0.74
NetMessage
-0.71
CLUD
-0.70
TAIN
-0.69
ockey
-0.69
haps
-0.65
acked
-0.65
inav
-0.65
urat
-0.64
ERSON
-0.63
POSITIVE LOGITS
financially
0.86
defensively
0.82
fulness
0.82
offensively
0.79
morally
0.72
.#
0.71
outweigh
0.71
¯
0.71
sacrificing
0.71
anonymously
0.71
Activations Density 0.028%