INDEX
Explanations
deeply harmful or dangerous
New Auto-Interp
Negative Logits
能够
0.41
能
0.40
ையுடன்
0.39
అందించ
0.37
但也
0.37
สามารถ
0.36
有时候
0.36
但
0.36
包含
0.36
included
0.35
POSITIVE LOGITS
horrific
0.74
dreadful
0.73
disgraceful
0.72
inept
0.71
appalling
0.70
hopelessly
0.70
unacceptable
0.69
horrendous
0.69
dismal
0.69
disgusting
0.68
Activations Density 0.558%