INDEX
Explanations
harmful or undesirable content
New Auto-Interp
Negative Logits
blindness
0.46
boredom
0.41
elegance
0.40
possibili
0.40
realism
0.38
NoError
0.37
stupidity
0.37
subtleties
0.37
eliness
0.37
heroism
0.36
POSITIVE LOGITS
harmful
1.67
problematic
1.55
undesirable
1.29
hazardous
1.27
unsafe
1.26
detrimental
1.26
inappropriate
1.25
risky
1.23
dangerous
1.21
destructive
1.21
Activations Density 0.603%