INDEX
Explanations
terms related to safety in various contexts
New Auto-Interp
Negative Logits
ocol
-0.17
427
-0.16
æĪ¶
-0.15
_RB
-0.15
éis
-0.15
eyen
-0.15
_HT
-0.15
ApplicationContext
-0.14
sto
-0.14
û
-0.14
POSITIVE LOGITS
/security
0.16
-minded
0.16
andre
0.16
ron
0.15
ÏĥÏĦα
0.15
tainment
0.15
(fake
0.14
ãĥ³ãĥĩ
0.14
Bureau
0.14
iliar
0.14
Activations Density 0.020%