INDEX
Explanations
mentions of the word "care" in various contexts
references to care-related terms
New Auto-Interp
Negative Logits
ãĥĥãĥī
-0.86
obin
-0.73
poppy
-0.68
perjury
-0.62
NPR
-0.61
DX
-0.61
integer
-0.61
rand
-0.61
irty
-0.61
RESULTS
-0.60
POSITIVE LOGITS
taker
1.39
fully
1.05
ful
0.97
tta
0.95
giving
0.95
er
0.93
care
0.92
taking
0.88
lli
0.87
ndra
0.85
Activations Density 0.017%