INDEX
Explanations
instances of demonstrative pronouns and adjectives
New Auto-Interp
Negative Logits
s
-0.19
er
-0.16
det
-0.16
orman
-0.16
uel
-0.16
ole
-0.15
ous
-0.15
r
-0.15
ont
-0.15
arta
-0.15
POSITIVE LOGITS
maal
0.17
że
0.17
ParameterValue
0.17
gre
0.16
ATRIX
0.16
ìłĢ
0.16
rale
0.16
avir
0.15
.openg
0.14
nect
0.14
Activations Density 0.031%