INDEX
Explanations
phrases indicating serious offenses or allegations
New Auto-Interp
Negative Logits
ÃŃch
-0.16
uv
-0.15
429
-0.14
ToProps
-0.14
icine
-0.14
ë§Ľ
-0.14
enheim
-0.14
.bam
-0.14
à¹Ĥล
-0.14
Smy
-0.13
POSITIVE LOGITS
serious
0.86
Serious
0.72
serious
0.70
seriousness
0.68
seriously
0.61
-ser
0.60
grave
0.55
severe
0.55
seri
0.53
Ser
0.52
Activations Density 0.070%