INDEX
Explanations
harmful content and instructions
New Auto-Interp
Negative Logits
provides
0.32
aids
0.32
專業
0.29
aided
0.28
Deluxe
0.28
hopefully
0.27
Made
0.27
&#
0.27
Provides
0.27
Gaff
0.26
POSITIVE LOGITS
полити
0.31
defeated
0.31
proble
0.30
gestire
0.30
ുമോ
0.30
统治
0.30
sbagli
0.30
politique
0.30
accusation
0.30
kuhusu
0.30
Activations Density 0.001%