INDEX
Explanations
instances of specific letters or characters
New Auto-Interp
Negative Logits
Amazon
-0.17
-se
-0.16
pa
-0.15
avia
-0.15
Cong
-0.15
perv
-0.15
kil
-0.15
ught
-0.14
undry
-0.14
iei
-0.14
POSITIVE LOGITS
apos
0.21
god
0.19
adr
0.19
avr
0.18
elo
0.18
bir
0.18
grad
0.18
unan
0.18
nan
0.17
idar
0.17
Activations Density 0.002%