INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
marav
0.48
强大
0.43
supremo
0.43
Faced
0.43
slaughtered
0.42
豪華
0.42
понадоби
0.41
spared
0.40
:)
0.40
ಚೆ
0.40
POSITIVE LOGITS
harmful
1.73
unacceptable
1.55
problematic
1.52
distressing
1.51
disturbing
1.50
troubling
1.50
damaging
1.44
detrimental
1.44
unhealthy
1.34
unsettling
1.33
Activations Density 0.965%