INDEX
Explanations
phrases with the word "no"
negations and phrases expressing refusal or prohibition
New Auto-Interp
Negative Logits
rn
-0.79
often
-0.76
typically
-0.73
turned
-0.72
mund
-0.72
then
-0.70
RAFT
-0.69
ellect
-0.69
nr
-0.67
olutely
-0.67
POSITIVE LOGITS
excuses
1.05
exceptions
1.00
refunds
0.95
xious
0.89
regrets
0.87
compulsion
0.85
surprises
0.83
compromises
0.83
tolerance
0.82
isy
0.82
Activations Density 0.099%