INDEX
Explanations
phrases that contrast different sides or perspectives on a topic
New Auto-Interp
Negative Logits
theless
-0.67
zon
-0.61
RELEASE
-0.60
Effective
-0.60
Progress
-0.60
cit
-0.60
Updated
-0.59
burg
-0.57
zan
-0.56
anned
-0.55
POSITIVE LOGITS
side
1.22
worldly
1.20
hemisphere
0.93
hand
0.92
half
0.89
most
0.88
iest
0.83
Side
0.83
end
0.83
halves
0.81
Activations Density 0.629%