INDEX
Explanations
phrases indicating surprise or disbelief
phrases that express a lack of awareness or understanding
New Auto-Interp
Negative Logits
rend
-0.80
ugal
-0.72
cipl
-0.70
unity
-0.66
eous
-0.66
srfAttach
-0.65
pora
-0.64
only
-0.64
ror
-0.62
illed
-0.61
POSITIVE LOGITS
remotely
1.33
bothering
0.99
bothered
0.94
vaguely
0.93
halfway
0.90
bother
0.90
hint
0.85
faintly
0.84
close
0.84
scratch
0.83
Activations Density 0.067%