INDEX
Explanations
phrases or words related to being direct, upfront, or uncomplicated
New Auto-Interp
Negative Logits
è¦ļéĨĴ
-0.79
Lauder
-0.76
livest
-0.71
mble
-0.68
ĸļ
-0.67
Downloadha
-0.66
Ples
-0.66
theless
-0.65
7601
-0.64
mur
-0.64
POSITIVE LOGITS
ened
1.35
away
1.20
ening
1.19
eners
1.09
forward
1.00
ener
0.92
edge
0.90
line
0.89
bent
0.88
FIX
0.88
Activations Density 0.023%