INDEX
Explanations
terms related to legal and social issues involving actions and behaviors
phrases related to harassment and assault
New Auto-Interp
Negative Logits
[/
-0.64
ortium
-0.62
izen
-0.59
*/(
-0.58
ynes
-0.57
ozy
-0.57
Lanka
-0.56
Reviewer
-0.55
]).
-0.55
actionDate
-0.54
POSITIVE LOGITS
their
1.40
their
1.23
theirs
1.18
THEIR
1.13
Their
1.10
they
1.05
they
1.01
They
0.99
THEY
0.96
Their
0.93
Activations Density 1.777%