INDEX
Explanations
occurrences of non-informative content
New Auto-Interp
Negative Logits
ban
-0.65
scha
-0.65
-
-0.63
Te
-0.61
bu
-0.58
points
-0.55
,
-0.54
te
-0.54
ver
-0.54
ad
-0.54
POSITIVE LOGITS
Jefus
1.31
myſelf
1.28
itſelf
1.25
Anſ
1.25
Houſe
1.21
auffi
1.19
himſelf
1.18
pleaſure
1.15
ſelf
1.14
ſeveral
1.14
Activations Density 0.257%