INDEX
Explanations
"and" followed by pronouns or articles
New Auto-Interp
Negative Logits
スの
0.48
იყოს
0.46
渶
0.45
特的
0.44
ouilles
0.44
小于
0.43
ری
0.43
твы
0.42
্রের
0.42
炵
0.41
POSITIVE LOGITS
1
0.63
2
0.55
0.55
!)
0.53
9
0.52
0
0.52
.
0.50
şağı
0.49
:}
0.49
қа
0.48
Activations Density 0.038%