INDEX
Explanations
terms related to controversial and inflammatory language, including slurs and provocative statements
terms related to controversial social and political identities
New Auto-Interp
Negative Logits
ispers
-0.84
rams
-0.80
suites
-0.80
izens
-0.75
Us
-0.74
Shots
-0.73
rils
-0.73
patches
-0.73
timelines
-0.72
Lans
-0.72
POSITIVE LOGITS
whore
0.82
prostitute
0.79
unto
0.79
digy
0.79
breaker
0.77
himself
0.74
believer
0.74
pretending
0.73
atical
0.73
nik
0.72
Activations Density 0.274%