INDEX
Explanations
phrases that introduce or highlight specific examples, lists, or references
New Auto-Interp
Negative Logits
llum
-0.16
еÑĨÑĤ
-0.15
สล
-0.15
ftware
-0.15
hua
-0.15
atters
-0.14
Commentary
-0.14
ingles
-0.14
orum
-0.14
pornstar
-0.14
POSITIVE LOGITS
example
0.23
exemple
0.20
example
0.19
examples
0.17
list
0.17
heads
0.17
Example
0.17
exemp
0.16
tip
0.16
hint
0.16
Activations Density 0.055%