INDEX
Explanations
phrases related to protection or security
terms related to protection and safety measures
New Auto-Interp
Negative Logits
lins
-0.76
gered
-0.73
NetMessage
-0.70
bender
-0.70
ergy
-0.68
chrome
-0.66
ctory
-0.65
hl
-0.64
kell
-0.64
clay
-0.64
POSITIVE LOGITS
safeguards
1.01
safegu
0.96
safeguard
0.93
protecting
0.90
saf
0.83
guarding
0.82
Protect
0.81
raints
0.81
shielding
0.80
protects
0.80
Activations Density 0.018%