INDEX
Explanations
mentions of different wars or war-related terms
occurrences of the word "wars"
New Auto-Interp
Negative Logits
ATURE
-0.65
AUT
-0.63
STER
-0.63
SOURCE
-0.63
YL
-0.60
gow
-0.60
Dialogue
-0.58
Accuracy
-0.57
Asset
-0.57
nosis
-0.57
POSITIVE LOGITS
hip
1.33
hips
1.30
pace
1.08
poons
0.95
uits
0.94
hops
0.94
pread
0.94
cale
0.92
mith
0.87
pite
0.85
Activations Density 0.029%