INDEX
Explanations
ending sentences with a specific word
New Auto-Interp
Negative Logits
touted
0.36
supportive
0.33
twor
0.33
competes
0.32
support
0.31
späteren
0.31
athletics
0.30
nort
0.29
hardware
0.29
support
0.29
POSITIVE LOGITS
DEATH
0.36
𝗢
0.35
바로
0.34
ANDO
0.33
!”.
0.32
časti
0.32
사람
0.32
Ი
0.32
கொடு
0.32
종
0.31
Activations Density 0.002%