INDEX
Explanations
mentions of ethical matters or breaches
references to ethics concepts and discussions
New Auto-Interp
Negative Logits
nces
-0.80
upt
-0.79
Clockwork
-0.69
xual
-0.69
down
-0.68
Ingram
-0.67
noon
-0.67
ept
-0.67
Jub
-0.67
eworld
-0.66
POSITIVE LOGITS
onomic
1.11
onom
0.88
dile
0.83
ostics
0.82
ethical
0.80
violations
0.76
watchdog
0.74
hazard
0.73
disclosure
0.73
breaches
0.73
Activations Density 0.029%