INDEX
Explanations
sentences indicating potential dangers or warnings
New Auto-Interp
Negative Logits
.)↵↵↵↵
-0.14
ï¼į
-0.14
ÂŃ
-0.14
ï¿¥
-0.14
wat
-0.14
adil
-0.13
ayload
-0.13
hon
-0.13
maj
-0.13
_invoke
-0.13
POSITIVE LOGITS
,
0.21
handjob
0.16
,↵
0.16
ÑĢд
0.15
illi
0.15
she
0.15
And
0.15
And
0.14
ØĮ
0.14
unde
0.14
Activations Density 0.074%