INDEX
Explanations
references to potential actions or consequences
New Auto-Interp
Negative Logits
Chal
-0.68
Xuan
-0.66
Writing
-0.58
Scand
-0.57
building
-0.57
Ready
-0.56
Scor
-0.55
Writing
-0.55
Vis
-0.54
Kag
-0.54
POSITIVE LOGITS
be
1.08
ideally
0.94
doubtless
0.92
undoubtedly
0.92
imply
0.92
suffice
0.91
allow
0.90
likely
0.89
eliminate
0.89
surely
0.89
Activations Density 0.196%