INDEX
Explanations
sexual exploitation and violence
New Auto-Interp
Negative Logits
unethical
0.68
inappropriate
0.61
questionable
0.61
unsustainable
0.57
dubious
0.56
hasty
0.55
unhealthy
0.54
improper
0.53
misleading
0.53
unfair
0.53
POSITIVE LOGITS
hearing
0.82
seeing
0.79
hearing
0.77
Hearing
0.72
Seeing
0.67
seeing
0.66
Hearing
0.65
imagining
0.61
Seeing
0.61
να
0.61
Activations Density 0.026%