INDEX
Explanations
safety reasons for prohibitions
New Auto-Interp
Negative Logits
tokens
0.44
Tokens
0.42
Boul
0.41
Tomas
0.40
トーク
0.39
tokens
0.39
抟
0.37
Token
0.37
一条
0.36
গুলোকে
0.36
POSITIVE LOGITS
Aside
0.43
理由
0.42
amatsu
0.40
threefold
0.40
unido
0.39
aside
0.39
ngunit
0.38
கிறது
0.38
lakini
0.38
INCLUDING
0.37
Activations Density 0.089%