INDEX
Explanations
statements expressing confusion or uncertainty
New Auto-Interp
Negative Logits
alink
-0.16
znam
-0.16
åľ
-0.15
CAST
-0.15
åļ
-0.15
iske
-0.15
emean
-0.14
.ba
-0.14
contres
-0.14
.cg
-0.14
POSITIVE LOGITS
mystery
0.25
baff
0.25
wonder
0.25
puzz
0.23
inexp
0.21
ocities
0.20
mysterious
0.19
puzzled
0.19
puzzles
0.19
Wonder
0.19
Activations Density 0.175%