INDEX
Explanations
phrases that express strong emotion or emphasis
symbols or characters that denote a specific concept or idea
New Auto-Interp
Negative Logits
dock
-0.74
oxide
-0.69
bipolar
-0.63
inspector
-0.62
antip
-0.61
iciency
-0.61
manned
-0.59
sacked
-0.58
reper
-0.58
bonded
-0.58
POSITIVE LOGITS
wait
1.04
etc
1.01
WHERE
1.01
yet
0.99
there
0.99
well
0.98
they
0.95
literally
0.94
until
0.93
nothing
0.92
Activations Density 0.038%