INDEX
Explanations
images or visuals described in text
instances of the word "the" in various contexts throughout the text
New Auto-Interp
Negative Logits
nesty
-0.81
uthor
-0.76
arians
-0.74
LEASE
-0.73
HCR
-0.72
mart
-0.71
olicy
-0.70
thood
-0.70
hun
-0.70
cially
-0.69
POSITIVE LOGITS
ceiling
1.04
smallest
1.04
vicinity
1.01
slightest
1.00
aforementioned
0.98
periphery
0.98
edges
0.97
same
0.97
entirety
0.96
nearest
0.96
Activations Density 0.572%