INDEX
Explanations
threatening language and confrontational interactions
New Auto-Interp
Negative Logits
broadly
-0.86
markedly
-0.85
strikingly
-0.83
xtap
-0.80
outset
-0.77
Historically
-0.76
reliance
-0.75
bolstered
-0.74
policymakers
-0.74
principally
-0.72
POSITIVE LOGITS
fuckin
1.65
fucking
1.52
shit
1.50
gonna
1.40
bitch
1.39
crap
1.35
fucked
1.35
fuck
1.32
shitty
1.31
haha
1.30
Activations Density 11.023%