INDEX
Explanations
mentions of legal documents or regulations
New Auto-Interp
Negative Logits
nodd
-0.70
boun
-0.69
strangers
-0.69
unlucky
-0.68
stranger
-0.67
stray
-0.67
predec
-0.67
stubborn
-0.67
manif
-0.66
passers
-0.66
POSITIVE LOGITS
Includes
1.20
These
1.18
Additionally
1.17
Retrieved
1.16
Also
1.07
Specifically
1.07
NOTE
1.06
Such
1.06
Originally
1.05
Among
1.02
Activations Density 0.364%