INDEX
Explanations
instances of the word "in," indicating it is looking for locational or contextual phrases
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.09
3:0.05
4:0.09
5:0.02
6:0.06
7:0.33
8:0.03
9:0.05
10:0.09
11:0.09
Negative Logits
Kro
-1.72
pharmacies
-1.49
Leban
-1.48
awatts
-1.47
eez
-1.45
itta
-1.44
glers
-1.42
undown
-1.42
pload
-1.42
yden
-1.42
POSITIVE LOGITS
Attribution
1.78
preserving
1.58
righteousness
1.55
disclaim
1.51
disav
1.42
educating
1.41
fidelity
1.40
Computing
1.40
preservation
1.39
existential
1.37
Activations Density 0.001%