INDEX
Explanations
expressions related to urgency or necessity
New Auto-Interp
Negative Logits
adelphia
-0.86
creen
-0.75
wife
-0.75
ells
-0.74
alid
-0.73
claimer
-0.72
dor
-0.71
aters
-0.70
imore
-0.70
atever
-0.69
POSITIVE LOGITS
attention
1.11
corrective
0.87
transparency
0.86
reinforcements
0.84
improvement
0.83
scrutiny
0.83
antidote
0.81
halt
0.80
overhaul
0.79
applause
0.78
Activations Density 0.016%