INDEX
Explanations
phrases that indicate doubt or contradiction
New Auto-Interp
Negative Logits
still
-0.17
Still
-0.16
still
-0.16
STILL
-0.15
плÑİ
-0.15
_contin
-0.14
Still
-0.14
å°ļ
-0.14
onto
-0.13
serialize
-0.13
POSITIVE LOGITS
WRONG
0.40
Wrong
0.39
reality
0.36
Wrong
0.35
well
0.35
wrong
0.34
well
0.32
Reality
0.32
Well
0.32
WELL
0.30
Activations Density 0.220%