INDEX
Explanations
expressions of subjective judgments about morality and behavior
New Auto-Interp
Negative Logits
\CMS
-0.16
Äĥn
-0.15
OLID
-0.15
nist
-0.15
ãģ¡ãĤĥ
-0.15
omas
-0.14
Screw
-0.14
μοÏĤ
-0.14
emes
-0.14
è¬
-0.14
POSITIVE LOGITS
posts
0.20
Posts
0.18
Thread
0.18
ä½łçļĦ
0.18
straw
0.17
/thread
0.17
OP
0.17
troll
0.16
your
0.16
posted
0.16
Activations Density 0.464%