INDEX
Explanations
instances of rule-breaking or violations
New Auto-Interp
Negative Logits
00200000
-0.71
vice
-0.69
ãĥ¼ãĥĨãĤ£
-0.66
tesque
-0.65
hate
-0.65
Hell
-0.64
Bank
-0.64
mega
-0.64
question
-0.63
Investor
-0.62
POSITIVE LOGITS
curfew
0.74
fins
0.72
vaccinations
0.71
performance
0.71
liberties
0.70
tranqu
0.70
transmissions
0.69
immersion
0.69
vaccination
0.69
orius
0.68
Activations Density 0.416%