INDEX
Explanations
phrases indicating surprise or unexpected realizations
New Auto-Interp
Negative Logits
alo
-0.14
lue
-0.13
nock
-0.13
lej
-0.13
eated
-0.13
ÑĮÑİ
-0.13
getti
-0.12
nul
-0.12
/single
-0.12
orang
-0.12
POSITIVE LOGITS
thought
0.48
expected
0.43
thought
0.42
Thought
0.42
expected
0.37
Thought
0.36
anticipated
0.34
hoped
0.33
Expected
0.33
assumed
0.33
Activations Density 0.362%