INDEX
Explanations
words related to moral concepts
concepts related to morality and virtue
New Auto-Interp
Negative Logits
eding
-0.85
location
-0.81
WER
-0.78
gow
-0.75
funding
-0.75
mining
-0.72
CAST
-0.72
raltar
-0.71
sites
-0.69
overed
-0.69
POSITIVE LOGITS
righteousness
0.85
precept
0.84
ocracy
0.81
incarn
0.81
ocratic
0.80
deeds
0.78
indignation
0.76
compass
0.71
morally
0.68
virtuous
0.68
Activations Density 0.053%