INDEX
Explanations
terms and phrases that indicate distinctions or comparisons between entities or concepts
New Auto-Interp
Negative Logits
Coolidge
-0.88
PhysRev
-0.86
Manpower
-0.85
Sagittarius
-0.82
Wallis
-0.81
Ayer
-0.81
Mehdi
-0.79
headers
-0.79
Camel
-0.79
Rosal
-0.78
POSITIVE LOGITS
distinguish
1.31
distingu
1.24
Dist
1.24
distingu
1.20
DIST
1.20
distingue
1.19
disting
1.19
DIST
1.18
dist
1.18
distinguishes
1.14
Activations Density 0.122%