INDEX
Explanations
pose a threat/risk/violation
New Auto-Interp
Negative Logits
黴
0.62
ዝ
0.58
高峰
0.57
주의
0.56
cautious
0.56
ow
0.56
افظ
0.56
ah
0.55
ాలంటే
0.55
Junction
0.54
POSITIVE LOGITS
teeth
0.80
existential
0.69
tooth
0.69
Trojan
0.68
ɬ
0.67
Transform
0.67
Teeth
0.66
attack
0.66
dientes
0.66
camada
0.65
Activations Density 0.147%