INDEX
Explanations
phrases related to comparisons or distinctions between different categories or concepts
conjunctions and words indicating contrast or alternatives
New Auto-Interp
Negative Logits
Carbuncle
-0.76
ngth
-0.75
Tears
-0.75
ttes
-0.74
amines
-0.74
chairs
-0.72
stanbul
-0.70
ulia
-0.69
oldemort
-0.68
Hew
-0.66
POSITIVE LOGITS
otherwise
1.12
nons
1.03
unex
0.99
non
0.95
uns
0.93
passive
0.92
unprotected
0.91
nont
0.90
unin
0.89
conventional
0.88
Activations Density 0.120%