INDEX
Explanations
country-related words and political or governmental references
New Auto-Interp
Negative Logits
earchers
-0.87
utenberg
-0.81
aurus
-0.80
earch
-0.77
equality
-0.76
ories
-0.76
etting
-0.76
ilion
-0.75
ecause
-0.75
mith
-0.73
POSITIVE LOGITS
hood
1.28
extraord
0.97
dom
0.95
doms
0.94
esses
0.88
who
0.81
gery
0.78
ess
0.78
ry
0.77
ishly
0.75
Activations Density 1.528%