INDEX
Explanations
phrases suggesting the importance of not solely relying on the speaker's claims
New Auto-Interp
Negative Logits
999
-0.06
no
-0.06
fol
-0.06
front
-0.05
Stam
-0.05
erer
-0.05
bard
-0.05
false
-0.05
ú
-0.05
rong
-0.05
POSITIVE LOGITS
alone
0.09
trust
0.09
alone
0.09
Trust
0.08
ÙħاÙĨÛĮ
0.08
банкÑĥ
0.08
Alone
0.07
trust
0.07
Trust
0.07
пов
0.07
Activations Density 0.004%