INDEX
Explanations
phrases indicating a sense of freshness or novelty
New Auto-Interp
Negative Logits
ist
-0.16
owns
-0.14
uce
-0.14
awan
-0.13
acak
-0.13
parts
-0.13
_procs
-0.13
fo
-0.13
McInt
-0.13
ord
-0.12
POSITIVE LOGITS
hle
0.17
elden
0.15
ikat
0.14
latex
0.14
.trailing
0.14
reib
0.14
uhn
0.14
ifu
0.14
dept
0.13
ahren
0.13
Activations Density 0.087%