INDEX
Explanations
phrases indicating understanding, justification, or explanation
phrases indicating comprehension or reasonableness of actions
New Auto-Interp
Negative Logits
etry
-0.73
esh
-0.70
eki
-0.69
pee
-0.67
eng
-0.66
infect
-0.66
ngth
-0.65
andon
-0.63
resh
-0.62
cler
-0.62
POSITIVE LOGITS
Mellon
0.83
indignation
0.80
DragonMagazine
0.75
FontSize
0.74
NPR
0.74
cffffcc
0.72
understandable
0.70
outrage
0.68
>>\
0.68
¶
0.67
Activations Density 0.048%