INDEX
Explanations
phrases related to aggressive actions or criticisms directed towards something or someone
expressions related to aggressive confrontations and disputes
New Auto-Interp
Negative Logits
artifacts
-0.74
OTA
-0.71
ylan
-0.68
Wonders
-0.68
Suc
-0.66
soDeliveryDate
-0.66
hess
-0.65
scope
-0.65
orderly
-0.63
sterdam
-0.63
POSITIVE LOGITS
accusing
1.20
insults
1.08
slurs
1.06
leveled
1.03
slander
1.03
tir
1.00
denouncing
0.99
criticizing
0.98
accusation
0.97
dispar
0.97
Activations Density 0.249%