INDEX
Explanations
names of individuals or places
the word "are" in various contexts
New Auto-Interp
Negative Logits
ingen
-0.80
omez
-0.76
ured
-0.72
insula
-0.70
obs
-0.69
urers
-0.69
ulates
-0.68
uration
-0.67
enegger
-0.66
elled
-0.65
POSITIVE LOGITS
tto
1.15
tta
1.07
lli
1.03
nce
1.03
ndra
0.96
pta
0.95
tsky
0.94
zza
0.94
llan
0.93
nda
0.93
Activations Density 0.025%