INDEX
Explanations
special characters or symbols in the text
New Auto-Interp
Negative Logits
“
-0.34
(“
-0.28
“[
-0.26
(
-0.26
“â̦
-0.25
“
-0.20
``
-0.20
"
-0.19
"(
-0.18
"`
-0.18
POSITIVE LOGITS
-'
0.23
fucking
0.23
ourselves
0.21
fuck
0.20
-↵↵
0.20
–↵↵
0.20
-"
0.20
-.
0.19
-↵↵
0.19
our
0.19
Activations Density 0.003%