INDEX
Explanations
phrases that indicate steps in a process or comparative assessments
New Auto-Interp
Negative Logits
hoe
-0.16
Diy
-0.15
uç
-0.14
uat
-0.14
ÅĻ
-0.14
_vlog
-0.13
Unsafe
-0.13
arkan
-0.13
blr
-0.13
abet
-0.13
POSITIVE LOGITS
again
0.17
straightforward
0.16
least
0.15
interesting
0.15
/archive
0.14
Again
0.14
another
0.14
most
0.14
controversial
0.14
Again
0.14
Activations Density 0.225%