INDEX
Explanations
phrases that express denial or contradiction
New Auto-Interp
Negative Logits
variation
-0.09
Vari
-0.09
Variation
-0.09
Vari
-0.08
itra
-0.08
variations
-0.08
ibold
-0.08
variants
-0.08
variation
-0.07
versions
-0.07
POSITIVE LOGITS
claim
0.08
intended
0.07
intent
0.07
intends
0.07
aim
0.07
meant
0.07
intend
0.07
intending
0.07
kins
0.06
zoekt
0.06
Activations Density 0.035%