INDEX
Explanations
instances of physical violence and abuse, including terms like "beaten," "raped," "burned," and "assaulted."
instances of physical abuse or violence
New Auto-Interp
Negative Logits
hang
-0.70
ARE
-0.69
istries
-0.66
formation
-0.65
FK
-0.65
alities
-0.64
allows
-0.63
tions
-0.63
Zone
-0.63
rium
-0.62
POSITIVE LOGITS
by
0.88
merciless
0.88
aback
0.85
ĸļ
0.80
unfairly
0.78
hostage
0.72
nikov
0.71
Sapphire
0.71
unnecessarily
0.71
inappropriately
0.70
Activations Density 0.194%