INDEX
Explanations
words related to safety and security
phrases emphasizing the concept of safety
New Auto-Interp
Negative Logits
amy
-0.73
iery
-0.70
betrayal
-0.69
willingness
-0.65
newsletters
-0.64
yi
-0.63
enf
-0.63
ilion
-0.63
directions
-0.62
Killer
-0.61
POSITIVE LOGITS
conclud
0.88
exting
0.80
dispose
0.80
mint
0.79
transitioned
0.77
ufact
0.76
evacuated
0.72
ãĤ©
0.72
aver
0.70
evacuate
0.69
Activations Density 0.026%