INDEX
Explanations
phrases that indicate impactful events or actions
New Auto-Interp
Negative Logits
eme
-0.18
kim
-0.15
paren
-0.15
.advance
-0.14
_unref
-0.14
æħ
-0.14
ãĤ
-0.14
ical
-0.14
меÑī
-0.14
istar
-0.14
POSITIVE LOGITS
jackpot
0.21
harder
0.20
hard
0.19
hit
0.17
Hard
0.17
pause
0.17
hitting
0.17
stride
0.16
targets
0.16
hardest
0.16
Activations Density 0.036%