ʻOhana
Population structure, admixture history, and selection using learning methods.
ohana
src
cpax
cpax/jade.main.cpp
1
/* -------------------------------------------------------------------------
2
Ohana
3
Copyright (c) 2015-2020 Jade Cheng (\___/)
4
Jade Cheng <info@jade-cheng.com> (='.'=)
5
------------------------------------------------------------------------- */
6
7
#include "jade.optimizer.hpp"
8
#include "jade.version.hpp"
9
10
namespace
11
{
12
char
const
*
const
usage = R
"(USAGE
13
cpax [options] <g-matrix>
14
15
ARGUMENTS
16
g-matrix the path to a genotype matrix; the format of
17
the file is determined based on the extension,
18
.dgm (discrete genotype matrix) or
19
.lgm (likelihood genotype matrix)
20
21
OPTIONS
22
--epsilon,-e indicates the next argument is the epsilon
23
value; i.e. the minimum difference between
24
likelihood calculations per iteration; this
25
value must be greater than or equal to zero
26
--help,-h shows this help message and exits
27
--f-epsilon,-fe indicates the next argument is the epsilon used
28
to clamp values of the allele frequency matrix
29
between (0 + fe) and (1 - fe); if unspecified,
30
this value defaults to 1.0e-6; the value must
31
be greater than 0.0 and less than 0.1
32
--fin,-fi indicates the next argument is the path to the
33
initial F matrix; this option cannot be used
34
with the --fin-force option
35
--fin-force,-fif indicates the next argument is the path to a
36
portion of the initial F matrix; this option
37
cannot be used with the --fin option
38
--fixed-f,-ff indicates the optimizer should not optimize the
39
specified F matrix
40
--fixed-q,-fq indicates the optimizer should not optimize the
41
specified Q matrix
42
--fout,-fo indicates the next argument is the path to the
43
computed F matrix
44
--force,-fg indicates the next argument is the path to a
45
file identifying an assignment of components
46
for each individual and a range of Q values for
47
each component
48
--frequency-bounds,-frb indicates the algorithm applies bounds between
49
1 / (2n + 1) and 1 - (1 / (2n + 1)) for allele
50
frequencies, where n is the number of
51
individuals; without this flag, this bounds
52
are set to 0 and 1
53
--ksize,-k indicates the next argument is the number of
54
components; this value must be at least one
55
--max-iterations,-mi indicates the next argument is the maximum
56
number of iterations to execute the algorithm;
57
this value must be greater than or equal to
58
zero
59
--max-time,-mt indicates the next argument is the maximum time
60
in seconds to execute the algorithm; this value
61
must be greater than or equal to zero
62
--qin,-qi indicates the next argument is the path to the
63
initial Q matrix
64
--qout,-qo indicates the next argument is the path to the
65
computed Q matrix
66
--seed,-s indicates the next argument is the seed for the
67
random number generator
68
69
At least one of --ksize, --qin, --fin, or --force must be specified in order
70
to determine the number of components (K).
71
72
DESCRIPTION
73
Under the assumption of Hardy Weinberg Equilibrium, the likelihood of
74
assigning an observed genotype g in individual i at locus j to component k
75
is a function of the allelic frequency f_kj of the locus at k and the
76
fraction of the genome of the individual q_ik that comes from that
77
component. We thus consider the likelihood of the ancestral component
78
proportions vector Q and their vector of allele frequencies F. In
79
particular, if we denote K as the number of ancestry components, I as the
80
number of individuals, and J as the number of polymorphic sites among the I
81
individuals, then the probability of observing the genotype is:
82
83
sum_i sum_j {
84
g_ij * ln[sum_k (q_ik * f_kj)] +
85
(2 - g_ij) * ln[sum_k (q_ik * (1 - f_kj))]
86
}
87
88
To estimate Q and F, we follow Newton's method. In general, we can
89
approximate a function F with its second order Taylor expansion F_T. To
90
solve this inequality- and equality-constraint quadratic optimization
91
problem, first we derive the first and second differentials for lnP1(Q, F)
92
with respect to values in Q and F, separately. Then we incorporate Lemke's
93
algorithm [Murty 1988].
94
95
[Command-Line Output]
96
97
Unless the --quiet option is specified, the program writes tabular
98
information to standard output. Each iteration, the program writes a row
99
with the following information:
100
101
1. iteration Number
102
2. seconds expired during the iteration
103
3. log-likelihood after the iteration
104
4. delta log-likelihood from the previous iteration
105
106
[Notation]
107
108
K := Number of Components
109
This value must be greater than or equal to one.
110
111
I := Number of Individuals
112
This value must be greater than or equal to one.
113
114
J := Number of Markers
115
This value must be greater than or equal to one.
116
117
G := [I x J] Discrete or Likelihood Genotype Matrix
118
dgm consists of integer values ranging from 0 to 3, inclusive.
119
0 := major-major allele
120
1 := major-minor allele
121
2 := minor-minor allele
122
3 := missing allele information
123
lgm with three floating-point value matrices in the following order.
124
n x m matrix, values 0.0 to 1.0 for minor-minor
125
n x m matrix, values 0.0 to 1.0 for major-minor
126
n x m matrix, values 0.0 to 1.0 for major-major
127
128
F := [K x J] Frequency Matrix
129
This matrix consists of floating-point values ranging from 0 to 1.
130
131
Q := [I x K] Proportion Matrix
132
This matrix consists of floating-point values ranging from 0 to 1. Each
133
row sums to 1.0.
134
135
[Matrix File Format]
136
137
A matrix file is an ASCII file in the following format. All values are
138
separated by white-space. The first value indicates the number of rows, the
139
second value indicates the number of columns, and then the remaining values
140
represent the values of the matrix in row-major order. For example, the
141
following represents a matrix with 2 rows and 4 columns:
142
143
2 4
144
1.0 2.0 3.0 4.0
145
5.0 6.0 7.0 8.0
146
147
In the case of the likelihood genotype matrix, three individual matrices
148
exist within the file representing minor-minor, major-minor, and major-major
149
allele frequencies. For example, the following represents a matrix with
150
2 rows (individuals) and 4 columns (markers):
151
152
2 4
153
0.1 0.2 0.3 0.4
154
0.5 0.6 0.7 0.8
155
2 4
156
0.9 0.8 0.7 0.6
157
0.5 0.4 0.3 0.2
158
2 4
159
0.1 0.9 0.2 0.8
160
0.3 0.7 0.4 0.6
161
162
[Forced-Grouping File Format]
163
164
A forced-grouping file is an ASCII file in the following format. All values
165
are separated by whitespace, and lines beginning with '#' are treated as
166
comments and ignored. The first value indicates the number of individuals
167
(N), the second value indicates the number of components (K), the next N
168
values represent the population index for each individual, and the remaining
169
values represent matrices of ranges for the Q matrix. Each matrix is
170
[2K x 1] in which the first K values indicate the minimum Q values, and the
171
last K values indicate the maximum Q values. For example, the following
172
represents a forged-grouping of 15 individuals in 3 components:
173
174
# 'I' Individuals and 'K' Components
175
15 3
176
177
# Component Assignments per Individual
178
0 1 2 0 0 0 1 1 1 0 0 1 2 2 2
179
180
# Population 0
181
6 1
182
0.4 0.0 0.0
183
1.0 1.0 1.0
184
185
# Population 1
186
6 1
187
0 0 0.4 0.0
188
1.0 1.0 0.1
189
190
# Population 2
191
6 1
192
0.0 0.0 0.4
193
1.0 0.1 1.0
194
195
EXAMPLES
196
$ cpax -k 4 -qo ./qout.matrix -fo ./fout.matrix -mi 5 ./g.dgm
197
seed: 3964111000
198
199
0 0.027389 -3.119424845029e+06
200
1 0.864429 -2.334222638280e+06 7.852022067495e+05
201
2 0.837323 -2.295676472449e+06 3.854616583094e+04
202
3 0.745620 -2.260783781893e+06 3.489269055576e+04
203
4 0.703584 -2.231310220386e+06 2.947356150691e+04
204
5 0.693284 -2.205170273510e+06 2.613994687624e+04
205
206
Writing Q matrix to ./qout.matrix
207
Writing F matrix to ./fout.matrix
208
209
BUGS
210
Report any bugs to Jade Cheng <info@jade-cheng.com>.
211
212
Copyright (c) 2015-2020 Jade Cheng
213
)";
214
}
215
216
///
217
/// The main entry point of the program.
218
/// \param argc The argument count.
219
/// \param argv The argument values.
220
/// \return EXIT_SUCCESS or EXIT_FAILURE.
221
///
222
int
main(
const
int
argc,
const
char
* argv[])
223
{
224
try
225
{
226
jade::args
args (argc, argv);
227
228
if
(args.read_flag(
"--help"
,
"-h"
))
229
{
230
std::cout << ::usage;
231
return
EXIT_SUCCESS;
232
}
233
234
if
(args.read_flag(
"--version"
,
"-v"
))
235
{
236
jade::version::write
(
"cpax"
, std::cout);
237
return
EXIT_SUCCESS;
238
}
239
240
typedef
double
value_type;
241
typedef
jade::basic_settings<value_type>
settings_type;
242
typedef
jade::basic_optimizer<value_type>
optimizer_type;
243
244
settings_type settings (args);
245
optimizer_type::execute(
246
settings,
247
settings.get_q(),
248
settings.get_f());
249
250
return
EXIT_SUCCESS;
251
}
252
catch
(
const
std::exception & e)
253
{
254
std::cerr << e.what() << std::endl;
255
return
EXIT_FAILURE;
256
}
257
}
jade::basic_settings
A template for a class encapsulating the settings provided to the optimizer.
Definition:
cpax/jade.settings.hpp:22
jade::basic_args
A template for a class that helps process command-line arguments.
Definition:
jade.args.hpp:19
jade::basic_version::write
static void write(char_type const *const title, ostream_type &out)
Writes the string displayed to the user.
Definition:
jade.version.hpp:29
jade::basic_optimizer
A template for a class that optimizes the Q and F matrices.
Definition:
cpax/jade.optimizer.hpp:21