INDEX
Explanations
segments related to death or severe criminal actions
New Auto-Interp
Negative Logits
\{\\-0.61
<eos>
-0.55
or
-0.48
George
-0.47
internet
-0.47
Lu
-0.47
Fns
-0.47
↵↵
-0.46
STAND
-0.46
↵↵↵
-0.45
POSITIVE LOGITS
pleaſure
0.85
purpoſe
0.75
ſind
0.75
myſelf
0.74
faſt
0.74
ſtate
0.74
iſt
0.73
Anſ
0.73
reaſon
0.72
cauſe
0.71
Activations Density 0.230%