INDEX
Explanations
negations or refusals
phrases emphasizing negation or the absence of something
New Auto-Interp
Negative Logits
urous
-0.64
essen
-0.60
gaze
-0.59
tle
-0.59
aus
-0.58
irds
-0.57
eline
-0.57
heid
-0.57
ambition
-0.56
ilde
-0.55
POSITIVE LOGITS
NOT
3.37
NOT
2.22
NEVER
1.98
ONLY
1.79
ALWAYS
1.70
ALSO
1.69
WITHOUT
1.51
THEN
1.50
VERY
1.50
REALLY
1.46
Activations Density 0.010%