INDEX
Explanations
commands or instructions starting with "First,"
introductory phrases or transitions in text
New Auto-Interp
Negative Logits
abled
-0.75
driving
-0.72
rams
-0.72
oslav
-0.71
nes
-0.70
ildo
-0.68
bd
-0.67
adv
-0.66
aden
-0.66
lain
-0.66
POSITIVE LOGITS
congratulations
0.89
let
0.88
introdu
0.87
congr
0.78
apologize
0.71
Introduction
0.70
lets
0.70
apologies
0.68
FIX
0.67
suppose
0.66
Activations Density 0.080%