INDEX
Explanations
phrases related to safety
phrases related to safety
New Auto-Interp
Negative Logits
orno
-0.78
yi
-0.76
iry
-0.74
agents
-0.69
betrayal
-0.68
amy
-0.67
crime
-0.66
oras
-0.65
plates
-0.65
lins
-0.64
POSITIVE LOGITS
exting
0.97
safely
0.96
conclud
0.95
outweigh
0.81
evacuated
0.80
ufact
0.79
veland
0.76
ãĤ©
0.75
transitioned
0.75
detonated
0.75
Activations Density 0.012%