INDEX
Explanations
mentions of being at the top
references to "top" positions or rankings
New Auto-Interp
Negative Logits
gm
-0.71
ellow
-0.66
ija
-0.64
fw
-0.63
ouri
-0.62
itta
-0.61
selves
-0.60
ewitness
-0.60
Parenthood
-0.58
fiance
-0.58
POSITIVE LOGITS
most
1.06
level
0.86
thereof
0.82
liest
0.76
mast
0.75
tier
0.75
side
0.75
end
0.74
loader
0.73
of
0.72
Activations Density 0.047%