INDEX
Explanations
the presence of specific characters or patterns in text
New Auto-Interp
Negative Logits
elif
-0.17
lish
-0.16
209
-0.15
ampie
-0.15
sis
-0.15
alc
-0.14
loating
-0.14
shadow
-0.14
Möglich
-0.14
Household
-0.14
POSITIVE LOGITS
ohn
0.23
agers
0.22
ese
0.21
age
0.21
AGER
0.20
izens
0.20
kw
0.19
agen
0.19
ern
0.19
andle
0.18
Activations Density 0.009%