INDEX
Explanations
unwanted sexual or harassing behavior
New Auto-Interp
Negative Logits
masterpiece
0.54
plug
0.52
killer
0.51
trillions
0.50
doom
0.49
optimized
0.49
dynamically
0.48
civilizations
0.46
optimization
0.46
evils
0.46
POSITIVE LOGITS
uncomfortable
0.82
harassing
0.80
harassment
0.80
intimidation
0.78
inappropriate
0.77
humiliating
0.74
escalating
0.72
conductas
0.72
comportamenti
0.71
discomfort
0.70
Activations Density 0.034%