INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
↵
-0.96
↵↵↵
-0.39
<eos>
-0.39
-0.35
↵↵↵↵
-0.34
-0.34
↵↵↵↵↵
-0.32
↵↵↵↵↵↵
-0.32
-0.32
-0.32
POSITIVE LOGITS
myſelf
1.18
himſelf
1.01
itſelf
0.98
purpoſe
0.98
pleaſure
0.95
NUMX
0.94
themſelves
0.94
whoſe
0.92
)";
0.91
Efq
0.91
Activations Density 0.000%