INDEX
Explanations
expressions of honesty and straightforwardness
New Auto-Interp
Negative Logits
=’
-0.64
étend
-0.63
InvalidProtocol
-0.60
AndWait
-0.59
ană
-0.57
μέ
-0.54
spéciaux
-0.54
GEBURTS
-0.54
льше
-0.53
artin
-0.52
POSITIVE LOGITS
Frankly
1.04
Honestly
1.02
honestly
1.00
frankly
0.98
Honestly
0.92
Tbh
0.89
tbh
0.89
honestly
0.83
说实话
0.83
admit
0.80
Activations Density 0.129%