INDEX
Explanations
specific words that refer to actions or objects, particularly nouns
conditional phrases suggesting dependence or consequence
New Auto-Interp
Negative Logits
AMA
-0.54
Medals
-0.54
contrad
-0.54
Caucus
-0.52
rhet
-0.52
Haas
-0.52
Hung
-0.51
disav
-0.51
ethn
-0.50
—
-0.50
POSITIVE LOGITS
pires
0.82
accompanies
0.79
caster
0.68
itiveness
0.67
older
0.67
ivalry
0.64
pired
0.64
wegian
0.63
Fast
0.63
OULD
0.62
Activations Density 0.942%