INDEX
Explanations
use affirmative or objective
New Auto-Interp
Negative Logits
ي
0.71
盾
0.50
িকে
0.50
ꔰ
0.49
ת
0.48
yó
0.48
كتب
0.47
다
0.47
Honestly
0.46
をお
0.46
POSITIVE LOGITS
Falcons
0.48
seres
0.46
avad
0.46
Gord
0.45
accharides
0.44
chats
0.44
극한
0.44
umed
0.43
ac
0.43
atians
0.43
Activations Density 0.000%