INDEX
Explanations
harmful illegal unethical wrong
New Auto-Interp
Negative Logits
Easier
0.41
pubescens
0.36
होला
0.35
playlists
0.34
విస్త
0.34
充実
0.34
capabilities
0.33
handy
0.33
ẵn
0.33
templates
0.33
POSITIVE LOGITS
unacceptable
1.49
unethical
1.45
inappropriate
1.26
wrong
1.16
problematic
1.12
immoral
1.11
objectionable
1.09
unwise
1.07
inappropri
1.06
undesirable
1.04
Activations Density 0.098%