INDEX
Explanations
words related to positive moral qualities or ethical concepts
concepts related to moral or ethical goodness
New Auto-Interp
Negative Logits
eters
-0.77
ĸļ
-0.76
oths
-0.73
ptin
-0.72
ategory
-0.71
otta
-0.70
opers
-0.70
pper
-0.68
Sturgeon
-0.67
kson
-0.67
POSITIVE LOGITS
intentions
1.14
Samar
1.11
deeds
1.07
reads
1.07
deed
1.06
enough
1.01
luck
1.00
luck
0.97
bye
0.94
ol
0.93
Activations Density 0.074%