INDEX
Explanations
words related to warnings or alerts, especially those related to potential negative consequences
references to "triggers" in various contexts
New Auto-Interp
Negative Logits
ensable
-0.77
hemat
-0.75
apest
-0.74
nian
-0.73
apolis
-0.73
Cutler
-0.70
cott
-0.70
ately
-0.69
ijk
-0.66
esan
-0.66
POSITIVE LOGITS
trigger
0.96
triggering
0.96
warnings
0.93
triggers
0.90
trigger
0.81
Trigger
0.76
triggered
0.74
witz
0.73
Warn
0.73
alerts
0.73
Activations Density 0.050%