INDEX
Explanations
adjectives that express evaluation or judgment
repetitive or placeholder content without specific descriptive elements
New Auto-Interp
Negative Logits
Hanson
-0.63
Barrett
-0.61
Ó
-0.61
ILA
-0.59
Vaugh
-0.59
sanctioned
-0.58
Hert
-0.58
Levine
-0.56
Thornton
-0.56
Mobil
-0.54
POSITIVE LOGITS
_
0.75
][
0.72
!!
0.68
!
0.67
)]
0.67
enough
0.65
];
0.64
enough
0.64
cookie
0.64
]
0.64
Activations Density 0.195%