INDEX
Explanations
phrases indicating positive qualities or actions
instances of the word "good."
New Auto-Interp
Negative Logits
eds
-0.79
idon
-0.76
eters
-0.72
anwhile
-0.72
ifles
-0.71
chan
-0.67
lees
-0.66
hani
-0.66
arthed
-0.66
osures
-0.66
POSITIVE LOGITS
chunk
1.15
enough
1.13
sword
0.99
approximation
0.99
enough
0.94
deal
0.92
ol
0.91
idea
0.88
amount
0.87
example
0.86
Activations Density 0.068%