INDEX
Explanations
information related to allegations of abuse and misconduct
references to allegations or claims of abuse or exploitation
New Auto-Interp
Negative Logits
Alright
-0.71
Tycoon
-0.70
worthiness
-0.69
})
-0.69
TBD
-0.68
resides
-0.66
Balance
-0.64
focus
-0.63
optimize
-0.63
bye
-0.63
POSITIVE LOGITS
harassed
1.00
witnessing
0.93
raped
0.92
mist
0.90
bullied
0.90
discriminated
0.89
coerced
0.88
humiliated
0.85
grop
0.84
subjected
0.83
Activations Density 0.396%