INDEX
Explanations
phrases related to incidents of violence and danger
New Auto-Interp
Negative Logits
smiles
-0.14
smile
-0.14
Tomorrow
-0.13
ologue
-0.13
Sender
-0.13
umpt
-0.13
ètre
-0.13
smiling
-0.13
.eng
-0.13
_iff
-0.13
POSITIVE LOGITS
witnessing
0.21
heard
0.21
dash
0.20
heard
0.20
intervene
0.19
intervened
0.19
intervention
0.18
hero
0.18
hearing
0.18
interven
0.18
Activations Density 0.130%