INDEX
Explanations
hunger strikes and suicides
New Auto-Interp
Negative Logits
бе
0.40
צרים
0.40
essentially
0.39
במה
0.38
выход
0.38
🚪
0.38
𝗅
0.37
closed
0.37
ería
0.37
snp
0.37
POSITIVE LOGITS
self
0.53
burns
0.49
imm
0.47
inm
0.46
Self
0.46
burn
0.46
self
0.46
焼
0.44
Selbst
0.44
temperatures
0.43
Activations Density 0.004%