INDEX
Explanations
mentions of the word "ali" with different activation levels
mentions of 'ali' or variations thereof
New Auto-Interp
Negative Logits
manship
-0.84
acters
-0.84
ly
-0.79
IAL
-0.77
ilater
-0.77
lier
-0.76
olicy
-0.74
nect
-0.74
ilaterally
-0.74
liest
-0.73
POSITIVE LOGITS
yah
1.17
ensis
0.99
Äĩ
0.98
ño
0.94
ña
0.85
ñ
0.84
qi
0.84
Lama
0.81
WAYS
0.81
ère
0.80
Activations Density 0.029%