INDEX
Explanations
expressions of gratitude or thanks
New Auto-Interp
Negative Logits
Default
-0.67
APD
-0.64
estyles
-0.60
Pred
-0.58
farious
-0.57
outper
-0.57
behavi
-0.56
dominates
-0.56
pred
-0.56
alth
-0.55
POSITIVE LOGITS
welcome
0.79
!!!!!
0.79
gracious
0.76
sir
0.76
thank
0.72
kindly
0.72
goodbye
0.72
Thank
0.71
generous
0.70
hello
0.67
Activations Density 0.021%