INDEX
Explanations
mentions of praise and positive feedback
New Auto-Interp
Negative Logits
Symbol
-0.15
Accepted
-0.14
wang
-0.14
ARGS
-0.14
Symbol
-0.14
ellig
-0.14
ina
-0.13
_WARNING
-0.13
trad
-0.13
اÙĦرÙħ
-0.13
POSITIVE LOGITS
praise
0.38
compliment
0.37
complement
0.35
praises
0.32
glowing
0.32
compliments
0.32
complimentary
0.30
praising
0.29
rave
0.29
comple
0.28
Activations Density 0.243%