INDEX
Explanations
sentences discussing individuals and their experiences or roles
New Auto-Interp
Negative Logits
less
-0.35
ılığı
-0.34
IAL
-0.33
uygun
-0.33
no
-0.31
وقد
-0.30
ノリ
-0.30
some
-0.29
üng
-0.29
did
-0.29
POSITIVE LOGITS
every
0.82
每一次
0.80
every
0.80
Chwiliwch
0.79
Every
0.79
ſte
0.77
Every
0.76
Anſ
0.73
Мексичка
0.73
każ
0.71
Activations Density 0.458%