INDEX
Explanations
words related to negative behavior or actions, specifically focusing on harassment
instances of the word "harassment" in various contexts
New Auto-Interp
Negative Logits
rians
-0.76
ACTED
-0.76
archs
-0.71
essential
-0.71
éĹĺ
-0.71
obb
-0.71
arch
-0.70
rich
-0.69
ramid
-0.68
stanbul
-0.67
POSITIVE LOGITS
harass
1.07
harassment
1.02
harassing
0.92
harassed
0.91
stalking
0.84
accus
0.78
assment
0.78
tactics
0.73
lords
0.73
complaints
0.72
Activations Density 0.017%