INDEX
Explanations
affirmations or expressions of agreement
New Auto-Interp
Negative Logits
tracted
-0.16
anca
-0.15
acie
-0.15
ãģ°ãģĭãĤĬ
-0.15
à¥ĩद
-0.14
umbo
-0.14
cı
-0.14
nowhere
-0.14
tn
-0.14
inciple
-0.14
POSITIVE LOGITS
indeed
0.40
inde
0.33
enia
0.30
/no
0.30
sir
0.29
Virginia
0.29
Indeed
0.29
inde
0.28
Indeed
0.27
sire
0.24
Activations Density 0.050%