INDEX
Explanations
names of organizations
proper nouns, particularly names and organizations
New Auto-Interp
Negative Logits
orate
-0.79
izons
-0.71
arding
-0.70
urate
-0.69
arded
-0.68
iard
-0.67
acting
-0.65
oard
-0.63
raising
-0.62
uminati
-0.62
POSITIVE LOGITS
plings
0.82
atchewan
0.77
eways
0.77
ustain
0.76
ority
0.73
utra
0.73
earcher
0.73
arin
0.73
Rough
0.72
ETH
0.71
Activations Density 0.169%