INDEX
Explanations
phrases indicating attribution or assigning responsibility
phrases that indicate causation or blame
New Auto-Interp
Negative Logits
Dur
-0.82
esson
-0.67
doesnt
-0.62
Advertisement
-0.61
ById
-0.60
Means
-0.60
nets
-0.59
didnt
-0.58
belts
-0.57
SourceFile
-0.57
POSITIVE LOGITS
blame
1.55
asted
1.43
asting
1.34
ying
1.27
wered
1.25
lled
1.24
iling
1.16
pless
1.13
gg
1.13
iled
1.10
Activations Density 0.081%