INDEX
Explanations
phrases indicating collective benefit and altruism
New Auto-Interp
Negative Logits
ordes
-0.15
deniz
-0.15
insky
-0.15
CTS
-0.14
/mainwindow
-0.14
åł¡
-0.14
_busy
-0.14
浪
-0.13
é¡į
-0.13
639
-0.13
POSITIVE LOGITS
benefit
0.37
good
0.36
common
0.35
greater
0.32
good
0.30
Good
0.30
Benefit
0.30
common
0.28
-good
0.28
/common
0.28
Activations Density 0.098%