INDEX
Explanations
historical references and terminology related to political movements or ideologies
New Auto-Interp
Negative Logits
190
-0.28
189
-0.24
191
-0.23
187
-0.23
188
-0.23
ctors
-0.20
192
-0.19
telegram
-0.19
186
-0.19
Alfred
-0.17
POSITIVE LOGITS
176
0.41
174
0.36
178
0.35
177
0.35
173
0.35
175
0.33
172
0.31
179
0.31
Enlightenment
0.30
171
0.27
Activations Density 0.152%