INDEX
Explanations
entities or names in a text-based setting
New Auto-Interp
Negative Logits
hyde
-0.93
weight
-0.90
axe
-0.89
agne
-0.87
ijn
-0.85
Fernand
-0.83
gran
-0.82
mson
-0.81
pared
-0.81
Kant
-0.79
POSITIVE LOGITS
IOR
1.17
idia
1.10
ANC
1.09
RL
1.09
ARA
1.02
ERC
1.00
CLA
0.99
ITE
0.98
IZ
0.98
vironment
0.97
Activations Density 0.148%