INDEX
Explanations
highly belligerent and confrontational language
terms related to conflicts or warfare
New Auto-Interp
Negative Logits
Tammy
-0.64
chops
-0.64
trophies
-0.61
bills
-0.61
Lake
-0.59
Mozilla
-0.59
Lake
-0.59
hoped
-0.58
hugs
-0.57
sm
-0.57
POSITIVE LOGITS
erent
4.91
eren
1.24
erential
1.23
arent
1.11
erence
1.07
iliar
1.05
arant
1.04
ividual
1.02
minist
1.00
erest
0.98
Activations Density 0.013%