INDEX
Explanations
phrases related to danger or potential harm
New Auto-Interp
Negative Logits
iao
-0.16
agues
-0.16
ares
-0.15
arez
-0.15
.sharedInstance
-0.15
ero
-0.15
ignon
-0.14
amin
-0.14
-called
-0.14
ÐŁÐļ
-0.14
POSITIVE LOGITS
danger
0.17
dangers
0.17
lessly
0.17
çĬ¶
0.17
baar
0.16
elm
0.16
weigh
0.16
jsp
0.15
ĺ
0.14
ources
0.14
Activations Density 0.034%