INDEX
Explanations
positive phrases and statements
expressions of positive emotions and sentiments
New Auto-Interp
Negative Logits
failed
-0.77
censored
-0.71
cens
-0.71
incompet
-0.69
Restrict
-0.69
antagonists
-0.68
offending
-0.67
delinquent
-0.67
redacted
-0.67
obscure
-0.67
POSITIVE LOGITS
appreciated
0.97
compliment
0.97
welcome
0.97
compliments
0.97
gratitude
0.97
grateful
0.95
commend
0.95
delighted
0.95
positive
0.93
invaluable
0.93
Activations Density 1.647%