INDEX
Explanations
words related to thorough discussions and descriptions of concepts or events
New Auto-Interp
Negative Logits
Townsend
-0.62
scrambled
-0.62
examined
-0.59
inspected
-0.59
punished
-0.58
dismissed
-0.57
Silent
-0.57
tons
-0.56
renamed
-0.56
stabbed
-0.56
POSITIVE LOGITS
BAT
0.82
azeera
0.77
hap
0.75
igmatic
0.72
farious
0.72
aceutical
0.72
ascar
0.72
avascript
0.71
dinand
0.71
arlane
0.71
Activations Density 0.226%