INDEX
Explanations
generic statements following a pattern, potentially related to advice or guiding principles
phrases that emphasize positive qualities or rules
New Auto-Interp
Negative Logits
jri
-0.74
gemony
-0.67
cape
-0.64
Himself
-0.64
eds
-0.63
uthor
-0.61
ruption
-0.61
eters
-0.61
vanquished
-0.60
hyde
-0.60
POSITIVE LOGITS
enough
1.10
reads
1.08
example
1.04
ol
1.03
bye
1.03
reason
1.00
luck
0.98
Samar
0.95
approximation
0.95
luck
0.93
Activations Density 0.068%