INDEX
Explanations
incidents involving violence or injury
New Auto-Interp
Negative Logits
asc
-0.16
ouch
-0.16
oÄį
-0.15
uncomment
-0.14
Uncomment
-0.14
itti
-0.14
ascar
-0.14
ograd
-0.14
andy
-0.14
ascript
-0.14
POSITIVE LOGITS
ihu
0.18
uka
0.15
errat
0.15
linkplain
0.15
elib
0.14
ting
0.14
iasi
0.14
ilis
0.14
elu
0.13
emale
0.13
Activations Density 0.413%