INDEX
Explanations
phrases that assert beliefs or clarify facts
New Auto-Interp
Negative Logits
ſtate
-0.83
pleaſure
-0.81
cuffs
-0.78
purpoſe
-0.77
ſame
-0.76
RSSSF
-0.76
Portály
-0.75
Yoh
-0.75
raiſ
-0.75
Weyl
-0.75
POSITIVE LOGITS
being
1.07
also
1.02
is
0.98
be
0.97
simply
0.84
was
0.84
not
0.83
going
0.82
être
0.80
becoming
0.78
Activations Density 0.321%