INDEX
Explanations
phrases indicating simplicity and goodness in various contexts
New Auto-Interp
Negative Logits
odel
-0.18
wiki
-0.13
Zwe
-0.13
cop
-0.13
rex
-0.13
ãĥľ
-0.13
wikipedia
-0.13
ãĥĥãĤ·ãĥ¥
-0.13
Lesser
-0.13
ÙĪÙĦا
-0.13
POSITIVE LOGITS
honest
0.18
understanding
0.17
romise
0.17
Honest
0.17
atro
0.17
è¯ļ
0.16
God
0.15
understand
0.15
Gentle
0.15
honesty
0.15
Activations Density 0.114%