INDEX
Explanations
expressions of agreement and appreciation in discussions
New Auto-Interp
Negative Logits
хьтан
-0.57
ConstraintMaker
-0.53
argest
-0.52
arbitrarily
-0.50
otomatig
-0.48
مشين
-0.47
Designer
-0.47
timeter
-0.46
нгред
-0.46
DELAY
-0.46
POSITIVE LOGITS
truth
0.65
valid
0.64
truths
0.63
truth
0.59
insightful
0.54
Truth
0.52
agree
0.51
verdades
0.51
valid
0.51
真理
0.51
Activations Density 0.382%