INDEX
Explanations
positive attributes or actions related to morality or ethics
phrases or concepts associated with "good faith" or general goodness
New Auto-Interp
Negative Logits
ĸļ
-0.80
eters
-0.72
olate
-0.71
Sturgeon
-0.70
otom
-0.70
_>
-0.68
EStream
-0.67
agos
-0.67
Canaver
-0.66
hyde
-0.66
POSITIVE LOGITS
intentions
1.27
Samar
1.23
luck
1.19
bye
1.18
reads
1.14
deeds
1.13
deed
1.12
ol
1.10
fortune
1.05
enough
1.04
Activations Density 0.068%