INDEX
Explanations
ambiguous or irrelevant content, as there is no consistent theme or pattern in the activations
specific phrases or references to entities associated with style or behavior in various contexts
New Auto-Interp
Negative Logits
<@
-0.70
gib
-0.68
Gmail
-0.66
decomp
-0.66
"+
-0.65
JPEG
-0.65
+++
-0.64
scrut
-0.61
fortun
-0.61
"<
-0.61
POSITIVE LOGITS
s
1.53
ski
1.10
scl
1.05
ses
1.02
ship
1.02
t
1.02
d
0.98
ved
0.97
tis
0.97
tal
0.95
Activations Density 0.338%