INDEX
Explanations
references to harm or injury
New Auto-Interp
Negative Logits
خاÙĨÙĩ
-0.16
ropa
-0.16
IFI
-0.15
irie
-0.15
swith
-0.15
sWith
-0.14
bil
-0.14
bie
-0.14
bies
-0.14
arehouse
-0.14
POSITIVE LOGITS
done
0.23
aceutical
0.18
ois
0.17
repair
0.17
ola
0.17
DONE
0.17
Done
0.17
ะ
0.16
Done
0.16
proof
0.16
Activations Density 0.023%