INDEX
Explanations
references to harm or damage in various contexts
New Auto-Interp
Negative Logits
enty
-0.16
rick
-0.16
lify
-0.15
Nİ
-0.15
klady
-0.15
shire
-0.14
.nz
-0.14
yı
-0.14
bject
-0.14
../../../../
-0.14
POSITIVE LOGITS
sustained
0.34
done
0.33
Done
0.28
DONE
0.27
done
0.27
Done
0.26
sustain
0.24
-done
0.24
done
0.22
sust
0.21
Activations Density 0.080%