INDEX
Explanations
concepts related to harm and injury
New Auto-Interp
Negative Logits
addock
-0.15
Äįi
-0.15
AGMA
-0.15
abbo
-0.15
ضÙĦ
-0.14
تز
-0.14
appy
-0.14
Ùĥر
-0.14
elpers
-0.13
ourg
-0.13
POSITIVE LOGITS
wake
0.34
wake
0.30
trail
0.27
Wake
0.27
Wake
0.25
wakes
0.25
Trail
0.22
Trail
0.22
trail
0.20
path
0.20
Activations Density 0.080%