INDEX
Explanations
repetitive phrases indicating similarity or comparison
expressions and phrases that indicate repetition or similarity
New Auto-Interp
Negative Logits
front
-0.75
dash
-0.72
ãĥĥ
-0.71
rend
-0.70
their
-0.70
Helpful
-0.68
rection
-0.68
ENDED
-0.68
replace
-0.68
orse
-0.67
POSITIVE LOGITS
applies
1.33
goes
1.22
thing
1.15
happens
1.07
holds
1.01
principle
0.99
cannot
0.99
principles
0.97
happened
0.96
fate
0.93
Activations Density 0.043%