INDEX
Explanations
phrases related to comparisons or distinctions between different entities
New Auto-Interp
Negative Logits
oldemort
-0.66
ario
-0.64
irez
-0.63
ysc
-0.63
idation
-0.62
nery
-0.62
ober
-0.61
ossession
-0.61
adal
-0.61
ipation
-0.61
POSITIVE LOGITS
st
0.91
Īè
0.86
IJ
0.85
Ĭ±
0.84
ĪĴ
0.83
those
0.82
stad
0.77
peers
0.75
ī
0.75
them
0.72
Activations Density 0.414%