INDEX
Explanations
phrases that emphasize similarity or equivalence
New Auto-Interp
Negative Logits
vernment
-0.91
heit
-0.71
schild
-0.70
ール
-0.69
numbered
-0.68
Supported
-0.66
netflix
-0.65
Interested
-0.65
ァ
-0.64
senal
-0.63
POSITIVE LOGITS
goes
0.86
applies
0.81
holds
0.71
occurs
0.69
happens
0.68
cannot
0.66
assumes
0.65
stuff
0.64
intuition
0.64
accum
0.64
Activations Density 0.014%