INDEX
Explanations
words related to warnings or commands that evoke specific reactions
terms related to triggers and their significance in various contexts
New Auto-Interp
Negative Logits
apest
-0.83
ensable
-0.75
hemat
-0.73
esan
-0.73
egal
-0.72
ographies
-0.71
jri
-0.70
apolis
-0.70
ately
-0.68
atography
-0.68
POSITIVE LOGITS
warnings
1.02
triggers
0.96
triggering
0.94
trigger
0.91
trigger
0.87
alerts
0.82
happy
0.81
warning
0.79
triggered
0.78
alert
0.77
Activations Density 0.038%