INDEX
Explanations
phrases expressing refusal or opposition
negations and expressions of refusal
New Auto-Interp
Negative Logits
nonetheless
-0.71
nevertheless
-0.70
ãĤ¼
-0.69
ãĥ¼ãĥĨ
-0.66
unmist
-0.65
ãĥ¯ãĥ³
-0.64
invariably
-0.64
swiftly
-0.64
senal
-0.63
simultaneously
-0.62
POSITIVE LOGITS
hin
1.34
fuckin
1.14
wanna
1.11
deserve
1.09
fucking
1.00
gonna
0.95
belong
0.92
condone
0.92
exist
0.91
EVEN
0.90
Activations Density 0.265%