INDEX
Explanations
words related to specific locations or tribes, likely the Tuareg tribe given the activations
the word "are" in different contexts
New Auto-Interp
Negative Logits
ingen
-0.80
omez
-0.76
ured
-0.76
ulates
-0.74
uration
-0.73
ues
-0.72
enegger
-0.72
inatory
-0.71
isting
-0.69
inosaur
-0.68
POSITIVE LOGITS
tto
1.12
tta
1.08
nce
1.01
lli
1.01
llan
0.97
nda
0.93
tsky
0.92
ndra
0.92
nces
0.90
zza
0.90
Activations Density 0.018%