INDEX
Explanations
phrases related to negative portrayals or descriptions involving individuals
references to characters and their roles in narratives, particularly villains and societal issues
New Auto-Interp
Negative Logits
Companies
-0.69
atars
-0.68
OY
-0.65
notations
-0.65
accordingly
-0.65
Production
-0.65
cu
-0.65
videos
-0.64
Wire
-0.64
Tickets
-0.64
POSITIVE LOGITS
sorts
0.97
nowhere
0.95
paradise
0.86
disguise
0.85
whom
0.83
whose
0.82
steroids
0.82
contrasts
0.79
nutshell
0.74
exile
0.73
Activations Density 0.295%