INDEX
Explanations
the word "all" with a high level of activation
New Auto-Interp
Negative Logits
IDS
-0.66
SHIP
-0.63
bledon
-0.62
Kamp
-0.62
sofar
-0.61
Caption
-0.61
KH
-0.60
Provision
-0.59
plin
-0.58
abwe
-0.57
POSITIVE LOGITS
igator
1.29
ocating
1.23
usion
1.13
igators
1.12
usions
1.05
uring
1.04
usive
1.00
iance
1.00
iances
0.99
edged
0.98
Activations Density 0.032%