INDEX
Explanations
code punctuation and abbreviations
New Auto-Interp
Negative Logits
a
0.84
a
0.79
t
0.76
in
0.76
at
0.71
ви
0.70
ა
0.70
(
0.65
фа
0.63
ü
0.63
POSITIVE LOGITS
σε
0.62
و
0.60
،
0.60
ไม่
0.55
ډول
0.55
meski
0.54
περιο
0.53
man
0.53
siglo
0.53
सार्वजनिक
0.52
Activations Density 0.337%