INDEX
Explanations
words related to causation and outcome
phrases that indicate consequences or effects
New Auto-Interp
Negative Logits
ĸļ
-0.73
Jur
-0.66
Bastard
-0.62
Kun
-0.60
Wend
-0.60
Pants
-0.58
Kush
-0.58
raint
-0.57
Sut
-0.57
Sabb
-0.57
POSITIVE LOGITS
depending
0.80
depending
0.78
tricky
0.72
gettable
0.72
safely
0.69
odder
0.68
ESE
0.68
anywhere
0.67
easily
0.64
pole
0.64
Activations Density 0.300%