INDEX
Explanations
phrases related to preventing or stopping someone or something from taking a specific action
references to preventing or stopping actions and behaviors
New Auto-Interp
Negative Logits
é¾įåĸļ士
-0.88
=-=-
-0.68
rette
-0.64
entin
-0.64
soDeliveryDate
-0.64
estyles
-0.63
BAT
-0.62
ETHOD
-0.60
Boone
-0.59
Extras
-0.59
POSITIVE LOGITS
from
1.08
from
0.94
happening
0.92
FROM
0.91
harming
0.91
slipping
0.85
bleeding
0.82
altogether
0.82
accessing
0.81
From
0.80
Activations Density 0.158%