INDEX
Explanations
phrases indicating actions, achievements, or obligations
New Auto-Interp
Negative Logits
Usaha
-0.42
themſelves
-0.40
elcome
-0.40
itſelf
-0.38
both
-0.38
alſo
-0.38
keduanya
-0.37
অ
-0.36
căng
-0.36
gleiche
-0.36
POSITIVE LOGITS
Only
0.94
Only
0.93
only
0.91
only
0.88
ONLY
0.84
ONLY
0.83
лишь
0.81
רק
0.71
только
0.70
Hanya
0.66
Activations Density 0.040%