INDEX
Explanations
phrases describing a comparison or certain types of actions
instances of the word "this" and phrases that denote examples or references
New Auto-Interp
Negative Logits
istries
-0.90
Ni
-0.77
erate
-0.70
half
-0.68
sent
-0.67
verning
-0.65
wa
-0.65
Wr
-0.64
ãĥ´ãĤ¡
-0.63
Fit
-0.63
POSITIVE LOGITS
spoiled
0.73
bookmark
0.65
improvised
0.60
outgoing
0.60
tip
0.60
agine
0.59
sunrise
0.58
modifier
0.57
guiActive
0.57
ragon
0.57
Activations Density 0.103%