INDEX
Explanations
indefinite articles followed by various nouns
New Auto-Interp
Negative Logits
faces
-0.16
unchecked
-0.16
ivos
-0.16
_dropout
-0.15
OrElse
-0.14
ponent
-0.13
bang
-0.13
िब
-0.13
cko
-0.13
fur
-0.13
POSITIVE LOGITS
cue
0.20
look
0.19
oath
0.18
Cue
0.18
closer
0.17
stab
0.17
liking
0.17
aliz
0.17
det
0.16
dete
0.16
Activations Density 0.027%