INDEX
Explanations
lists or asks questions
high-frequency function words and structural/formatting tokens (e.g., articles, prepositions, modals, punctuation, and control/section markers).
New Auto-Interp
Negative Logits
あくまで
0.29
subtlety
0.27
intégr
0.27
Anzahl
0.26
playmaker
0.26
červ
0.25
மொத்தம்
0.25
Mutations
0.25
wechsl
0.24
Holds
0.24
POSITIVE LOGITS
риа
0.26
様専用
0.25
ларда
0.25
ῶν
0.24
MENTS
0.23
ेलर
0.23
ιου
0.23
اريات
0.23
다가
0.23
ικού
0.23
Activations Density 0.556%