INDEX
Explanations
numerical structures such as lists or countdowns
the repeated appearance of a specific character or symbol
New Auto-Interp
Negative Logits
adolesc
-0.66
suicide
-0.65
ignition
-0.64
Morse
-0.63
Spot
-0.63
Stuff
-0.62
Lancaster
-0.62
undown
-0.61
Antar
-0.61
Addiction
-0.60
POSITIVE LOGITS
agree
1.02
own
0.98
ï¸ı
0.97
should
0.89
felt
0.87
mand
0.87
tarians
0.84
tu
0.84
selves
0.83
ould
0.83
Activations Density 0.161%