INDEX
Explanations
phrases related to causality or consequence
phrases that indicate significant emphasis or importance
New Auto-Interp
Negative Logits
scatter
-0.74
scattering
-0.72
paternal
-0.66
stagger
-0.64
dirt
-0.64
prol
-0.64
eleph
-0.64
Annotations
-0.64
tremend
-0.63
cyan
-0.63
POSITIVE LOGITS
¹
1.03
£
0.96
º
0.94
¢
0.89
¡
0.87
Ī
0.86
¬
0.85
į
0.85
ı
0.84
¼
0.84
Activations Density 0.762%