INDEX
Explanations
phrases related to actions causing negative outcomes or harm
instances of causation leading to negative outcomes
New Auto-Interp
Negative Logits
Horizons
-0.61
itar
-0.61
demos
-0.60
gaard
-0.60
follow
-0.59
oped
-0.59
Transition
-0.58
text
-0.58
itarian
-0.57
interviews
-0.57
POSITIVE LOGITS
causing
3.32
inflicting
1.97
preventing
1.84
harming
1.78
injuring
1.77
affecting
1.73
disrupting
1.72
ruining
1.71
triggering
1.71
provoking
1.70
Activations Density 0.025%