INDEX
Explanations
phrases indicating similarity or sameness
repeated references to the concept of sameness
New Auto-Interp
Negative Logits
ases
-0.80
*=-
-0.76
åĪ
-0.71
uria
-0.67
gets
-0.67
rosso
-0.67
airs
-0.66
HI
-0.65
rection
-0.65
Provided
-0.65
POSITIVE LOGITS
thing
0.96
exact
0.86
ol
0.76
amount
0.71
ballpark
0.71
iating
0.70
everywhere
0.70
worldly
0.69
kind
0.67
playbook
0.67
Activations Density 0.036%