INDEX
Explanations
phrases indicating a recommendation to watch something
references to watching videos or content
New Auto-Interp
Negative Logits
ctrl
-0.98
phi
-0.72
VEN
-0.69
ãĤ¨ãĥ«
-0.69
ãĥ´
-0.68
interstitial
-0.68
sembly
-0.67
ascal
-0.66
cffffcc
-0.66
misunderstanding
-0.65
POSITIVE LOGITS
tower
1.26
Watching
1.13
dog
1.04
dogs
0.98
Watch
0.85
Watch
0.84
Dogs
0.84
ing
0.82
watch
0.81
WATCH
0.80
Activations Density 0.021%