INDEX
Explanations
references to specific brands or names within the text
New Auto-Interp
Negative Logits
gether
-0.23
bidden
-0.22
etheless
-0.20
adays
-0.19
tempts
-0.19
achusetts
-0.18
vasion
-0.17
quarters
-0.17
theless
-0.16
gomery
-0.16
POSITIVE LOGITS
orem
0.17
chyb
0.16
infeld
0.15
imiter
0.15
0.14
="__
0.14
iland
0.14
bie
0.14
dish
0.14
alic
0.13
Activations Density 0.193%