INDEX
Explanations
phrases indicating a speaker's self-reference or directives
New Auto-Interp
Negative Logits
ivet
-0.17
apult
-0.15
_hal
-0.15
ICENSE
-0.14
GOODMAN
-0.14
TEMPL
-0.14
ablo
-0.14
ÅĻiv
-0.14
568
-0.14
IGNAL
-0.13
POSITIVE LOGITS
clearing
0.15
stip
0.15
_simps
0.15
ecta
0.14
confession
0.14
ultra
0.14
klar
0.14
Haz
0.14
ECT
0.14
Mis
0.14
Activations Density 0.085%