INDEX
Explanations
humor and playful language
New Auto-Interp
Negative Logits
–↵↵
-0.19
ëĶĶìĭľ
-0.16
isContained
-0.15
ëĦ¤ìĿ´íĬ¸
-0.15
–
-0.15
:↵↵
-0.14
ÙĪØ°ÙĦÙĥ
-0.14
ëį°ìĿ´íĬ¸
-0.14
–
-0.14
:↵↵↵
-0.14
POSITIVE LOGITS
ppl
0.18
[=
0.17
...
0.16
...,
0.15
/thread
0.15
''
0.15
...)
0.15
_
0.15
Reply
0.15
thread
0.15
Activations Density 3.981%