INDEX
Explanations
instances where the text mentions contrasting or different options
references to alternative options or consequences
New Auto-Interp
Negative Logits
Encyclopedia
-0.70
Mehran
-0.64
Abstract
-0.63
ãĥī
-0.61
UES
-0.60
Lenin
-0.60
oret
-0.58
Reef
-0.58
Upload
-0.58
forestation
-0.57
POSITIVE LOGITS
worldly
1.19
besides
0.94
entirely
0.78
arettes
0.73
where
0.70
isin
0.70
Joined
0.69
mia
0.68
adin
0.68
¯
0.64
Activations Density 0.037%