INDEX
Explanations
phrases or concepts that are fundamentally important or flawed
phrases that emphasize foundational aspects or differences
New Auto-Interp
Negative Logits
ries
-0.75
Commissioners
-0.69
Chronicle
-0.68
tein
-0.67
supervisors
-0.65
Purchase
-0.64
Supervisor
-0.64
Guys
-0.63
runners
-0.63
Maid
-0.62
POSITIVE LOGITS
differentiated
0.86
00007
0.80
altering
0.77
restruct
0.76
wedd
0.74
disrupting
0.73
gebra
0.73
housed
0.72
embodied
0.71
ascript
0.71
Activations Density 0.011%