INDEX
Explanations
instances of differentiation and distinctions among concepts or categories
New Auto-Interp
Negative Logits
earnest
-0.68
excess
-0.65
understatement
-0.62
ccoli
-0.58
upid
-0.56
underrated
-0.56
omission
-0.55
fin
-0.55
unc
-0.54
absence
-0.54
POSITIVE LOGITS
altogether
1.02
than
0.95
depending
0.95
than
0.92
iates
0.88
iating
0.87
iations
0.85
different
0.83
Different
0.81
styles
0.78
Activations Density 0.749%