INDEX
Explanations
words related to moral or personal virtues
terms related to morality and virtue
New Auto-Interp
Negative Logits
ockets
-0.82
oval
-0.71
bered
-0.67
grown
-0.67
ZI
-0.65
alone
-0.61
gren
-0.61
fragmented
-0.60
cedented
-0.60
KS
-0.60
POSITIVE LOGITS
virtue
1.02
signalling
0.93
Virtue
0.88
dilig
0.83
signaling
0.80
iosity
0.78
Deity
0.76
ienne
0.76
distingu
0.75
resil
0.73
Activations Density 0.007%