INDEX
Explanations
language related to safety and security
phrases related to safety concerns
New Auto-Interp
Negative Logits
dx
-0.88
issance
-0.85
eric
-0.78
eta
-0.78
igs
-0.75
sth
-0.75
naire
-0.74
sonian
-0.69
yss
-0.66
iguous
-0.64
POSITIVE LOGITS
safety
1.20
ailability
1.00
safety
0.95
saf
0.87
Þ
0.85
practition
0.84
Safety
0.81
Safety
0.80
condem
0.80
ingred
0.75
Activations Density 0.021%