INDEX
Explanations
terms related to Western culture and influence
New Auto-Interp
Negative Logits
TO
-0.17
stown
-0.16
ocab
-0.16
plete
-0.15
amma
-0.14
efon
-0.14
FromClass
-0.14
uld
-0.14
odore
-0.14
utr
-0.14
POSITIVE LOGITS
ern
0.35
ward
0.31
ERN
0.30
ers
0.30
erner
0.28
most
0.28
s
0.26
eners
0.24
/right
0.23
ern
0.23
Activations Density 0.038%