INDEX
Explanations
"can" followed by potential action
New Auto-Interp
Negative Logits
ua
1.04
ور
0.96
드
0.95
Denne
0.92
ujuan
0.91
anciens
0.87
oma
0.85
ik
0.84
ANG
0.84
DM
0.83
POSITIVE LOGITS
ס
1.16
ה
1.09
ه
0.98
an
0.96
া
0.95
ע
0.93
ان
0.91
การ
0.91
de
0.87
on
0.87
Activations Density 0.625%