INDEX
Explanations
instances of comparison and contrasting between different subjects or concepts
New Auto-Interp
Negative Logits
ough
-0.17
yen
-0.16
оÑĥ
-0.16
ated
-0.15
elry
-0.15
ereotype
-0.15
etur
-0.15
uth
-0.15
ration
-0.14
oper
-0.14
POSITIVE LOGITS
favor
0.17
isons
0.17
favor
0.17
ãģ¹
0.17
atively
0.17
against
0.16
apples
0.16
unfavor
0.16
contrast
0.16
Against
0.16
Activations Density 0.030%