INDEX
Explanations
numbers, dates, and measurements within the text
New Auto-Interp
Negative Logits
nin
-0.18
80
-0.18
72
-0.17
Nin
-0.15
sevent
-0.15
eighty
-0.15
87
-0.15
78
-0.15
79
-0.15
92
-0.15
POSITIVE LOGITS
194
0.83
Û±Û¹Û´
0.56
195
0.54
Û±Û¹Ûµ
0.40
193
0.37
wartime
0.34
WWII
0.31
fasc
0.26
war
0.24
fascism
0.23
Activations Density 0.050%