INDEX
Explanations
phrases indicating surprise or unexpectedness
New Auto-Interp
Negative Logits
enderror
-0.15
incons
-0.14
IXEL
-0.13
Schwe
-0.13
aight
-0.13
_sensitive
-0.13
ewire
-0.13
اتÙĩ
-0.13
bsub
-0.13
_FAULT
-0.13
POSITIVE LOGITS
surprise
0.88
surprises
0.77
Surprise
0.76
surprised
0.68
surpr
0.66
surprising
0.60
sur
0.59
Sur
0.57
-sur
0.56
unexpected
0.55
Activations Density 0.301%