INDEX
Explanations
phrases that indicate comparisons or analogies
New Auto-Interp
Negative Logits
#$
-0.66
Posts
-0.66
aldo
-0.65
rous
-0.65
ãģł
-0.64
OUS
-0.63
itton
-0.62
regon
-0.59
ouk
-0.59
OUR
-0.59
POSITIVE LOGITS
pired
1.21
pires
1.20
portrayed
0.97
depicted
0.95
ociated
0.95
pects
0.93
well
0.93
phy
0.91
semb
0.89
pire
0.88
Activations Density 0.084%