INDEX
Explanations
instances of high activation, suggesting an emphasis on key points in discussions or texts
New Auto-Interp
Negative Logits
atile
-0.18
oe
-0.16
ye
-0.16
ya
-0.15
ARNING
-0.15
Ay
-0.14
LATED
-0.14
pdu
-0.14
Äĩe
-0.14
yyy
-0.14
POSITIVE LOGITS
boat
0.16
apus
0.16
ĥ
0.15
elocity
0.15
uzz
0.15
atego
0.15
urdy
0.14
ãĥ¼ãĥģ
0.14
ancode
0.14
egasus
0.14
Activations Density 0.199%