INDEX
Explanations
phrases that encourage observation or reflection
New Auto-Interp
Negative Logits
ernel
-0.16
hait
-0.15
arter
-0.14
лаÑģ
-0.14
pent
-0.14
prit
-0.14
_ghost
-0.13
akens
-0.13
itters
-0.13
iale
-0.13
POSITIVE LOGITS
ascar
0.17
oda
0.15
312
0.14
одÑĥ
0.14
amac
0.14
Solar
0.14
IPH
0.14
expressive
0.13
sut
0.13
anye
0.13
Activations Density 0.070%