INDEX
Explanations
negative responses or refusals
negative affirmations or words expressing refusal
New Auto-Interp
Negative Logits
RAFT
-0.76
lycer
-0.75
iership
-0.74
ulative
-0.66
romeda
-0.66
endish
-0.65
IUM
-0.64
assies
-0.64
ual
-0.64
rious
-0.63
POSITIVE LOGITS
xious
1.12
zzle
0.99
matter
0.93
except
0.92
obs
0.88
longer
0.86
oses
0.86
ct
0.84
ises
0.83
AH
0.81
Activations Density 0.087%