INDEX
Explanations
short phrases indicating importance, decision-making, and perspective
instances of placeholder text or empty signals
New Auto-Interp
Negative Logits
Instr
-0.70
Clarkson
-0.69
Hert
-0.65
Oval
-0.65
Mobil
-0.64
Borders
-0.63
Berk
-0.62
Ninth
-0.62
Wr
-0.61
Front
-0.61
POSITIVE LOGITS
][
1.01
))
0.88
)]
0.87
)))
0.87
_
0.84
));
0.82
gpu
0.81
):
0.80
);
0.80
]
0.80
Activations Density 0.168%