INDEX
Explanations
the word "But" as the common thread among the activations
the word "But" and its repeated use to introduce contrasting statements or ideas
New Auto-Interp
Negative Logits
¯¯¯¯
-0.63
ãģ®
-0.61
heads
-0.60
fell
-0.60
().
-0.59
segment
-0.58
.","
-0.57
pointer
-0.56
built
-0.55
bound
-0.54
POSITIVE LOGITS
tons
1.24
chers
0.88
alas
0.86
theless
0.85
withstanding
0.82
romeda
0.81
owsky
0.77
ts
0.76
LER
0.76
tif
0.75
Activations Density 0.088%