INDEX
Explanations
header or introductory markers indicating a beginning of a new section or topic
New Auto-Interp
Negative Logits
-0.66
supposedly
-0.66
<eos>
-0.63
clunky
-0.61
desperation
-0.59
UI
-0.57
clueless
-0.56
A
-0.56
supposed
-0.56
exactly
-0.55
POSITIVE LOGITS
itſelf
1.24
myſelf
1.22
ſelf
1.19
pleaſure
1.12
ſelves
1.12
auffi
1.11
purpoſe
1.10
Efq
1.08
ſtate
1.05
cauſe
1.04
Activations Density 0.304%