INDEX
Explanations
language related to discussions, descriptions, and explanations of situations, decisions, and events
New Auto-Interp
Negative Logits
confir
-0.70
iem
-0.66
depth
-0.64
promoter
-0.61
sonian
-0.61
reached
-0.58
gathers
-0.57
ipeg
-0.56
complied
-0.55
iasm
-0.54
POSITIVE LOGITS
phas
1.05
enance
0.94
themselves
0.78
igate
0.76
igated
0.76
ourselves
0.75
favorably
0.72
igating
0.71
blame
0.67
virtues
0.66
Activations Density 2.632%