INDEX
Explanations
even followed by harmful actions
New Auto-Interp
Negative Logits
appunto
0.45
anzi
0.43
it
0.43
сцю
0.42
ziehungs
0.41
🄴
0.41
thereby
0.41
ellos
0.40
ственно
0.40
fashioned
0.39
POSITIVE LOGITS
即使
0.60
حتی
0.57
Even
0.57
Even
0.56
даже
0.56
EVEN
0.54
even
0.52
حتى
0.52
就算
0.52
even
0.51
Activations Density 0.052%