INDEX
Explanations
expressions of apology or excuses
New Auto-Interp
Negative Logits
odel
-0.17
ald
-0.16
inger
-0.15
dots
-0.14
ÂŃ
-0.14
Nam
-0.14
.dot
-0.14
-dot
-0.14
linger
-0.14
Nam
-0.13
POSITIVE LOGITS
fcn
0.19
CAUSED
0.17
aukee
0.15
lue
0.14
Jvm
0.14
oldt
0.14
&view
0.14
Dress
0.14
late
0.14
validate
0.14
Activations Density 0.036%