INDEX
Explanations
frequent pronouns and articles in sentences
New Auto-Interp
Negative Logits
s
-1.07
Witt
-0.86
ness
-0.81
ses
-0.79
Ej
-0.74
nya
-0.73
böz
-0.73
cí
-0.73
Assisi
-0.72
Haf
-0.72
POSITIVE LOGITS
aDecoder
0.97
στη
0.93
صوتيه
0.88
Bue
0.86
detainees
0.84
Parke
0.84
τη
0.82
quoique
0.81
Marconi
0.81
ΤΗ
0.81
Activations Density 0.066%