INDEX
Explanations
references to military incidents and their potential consequences
New Auto-Interp
Negative Logits
ÏĦÎŃ
-0.16
ute
-0.15
roy
-0.15
Sugar
-0.15
661
-0.14
eded
-0.14
ROY
-0.14
inals
-0.14
utan
-0.14
mole
-0.14
POSITIVE LOGITS
peace
0.26
peace
0.24
Peace
0.23
peaceful
0.22
Peace
0.22
war
0.18
-war
0.17
peacefully
0.17
æł¸
0.15
å¾Ģ
0.15
Activations Density 0.190%