INDEX
Explanations
the word "one" with varying strengths of activation for different contexts
references to the concept of "one" or singular items
New Auto-Interp
Negative Logits
osponsors
-1.14
rations
-0.93
pees
-0.79
lations
-0.79
apons
-0.78
ooks
-0.78
etz
-0.78
ourses
-0.77
endars
-0.77
uts
-0.76
POSITIVE LOGITS
thing
1.28
caveat
1.12
glaring
1.12
overarching
1.09
overriding
1.05
pecul
1.01
undeniable
1.00
drawback
0.98
exception
0.96
notable
0.93
Activations Density 0.092%