INDEX
Explanations
phrases related to comparisons or lists of items
phrases that indicate comparison or similarity
New Auto-Interp
Negative Logits
ople
-0.83
itiveness
-0.70
/+
-0.67
trap
-0.63
rous
-0.62
======
-0.61
atism
-0.60
\<
-0.59
aca
-0.59
orage
-0.58
POSITIVE LOGITS
well
1.75
well
1.39
opposed
1.13
pects
0.99
ynchron
0.95
part
0.95
ociated
0.90
ides
0.88
diverse
0.88
Well
0.87
Activations Density 0.121%