ʻOhana
Population structure, admixture history, and selection using learning methods.
cpax/jade.main.cpp
1 /* -------------------------------------------------------------------------
2  Ohana
3  Copyright (c) 2015-2020 Jade Cheng (\___/)
4  Jade Cheng <info@jade-cheng.com> (='.'=)
5  ------------------------------------------------------------------------- */
6 
7 #include "jade.optimizer.hpp"
8 #include "jade.version.hpp"
9 
10 namespace
11 {
12  char const * const usage = R"(USAGE
13  cpax [options] <g-matrix>
14 
15 ARGUMENTS
16  g-matrix the path to a genotype matrix; the format of
17  the file is determined based on the extension,
18  .dgm (discrete genotype matrix) or
19  .lgm (likelihood genotype matrix)
20 
21 OPTIONS
22  --epsilon,-e indicates the next argument is the epsilon
23  value; i.e. the minimum difference between
24  likelihood calculations per iteration; this
25  value must be greater than or equal to zero
26  --help,-h shows this help message and exits
27  --f-epsilon,-fe indicates the next argument is the epsilon used
28  to clamp values of the allele frequency matrix
29  between (0 + fe) and (1 - fe); if unspecified,
30  this value defaults to 1.0e-6; the value must
31  be greater than 0.0 and less than 0.1
32  --fin,-fi indicates the next argument is the path to the
33  initial F matrix; this option cannot be used
34  with the --fin-force option
35  --fin-force,-fif indicates the next argument is the path to a
36  portion of the initial F matrix; this option
37  cannot be used with the --fin option
38  --fixed-f,-ff indicates the optimizer should not optimize the
39  specified F matrix
40  --fixed-q,-fq indicates the optimizer should not optimize the
41  specified Q matrix
42  --fout,-fo indicates the next argument is the path to the
43  computed F matrix
44  --force,-fg indicates the next argument is the path to a
45  file identifying an assignment of components
46  for each individual and a range of Q values for
47  each component
48  --frequency-bounds,-frb indicates the algorithm applies bounds between
49  1 / (2n + 1) and 1 - (1 / (2n + 1)) for allele
50  frequencies, where n is the number of
51  individuals; without this flag, this bounds
52  are set to 0 and 1
53  --ksize,-k indicates the next argument is the number of
54  components; this value must be at least one
55  --max-iterations,-mi indicates the next argument is the maximum
56  number of iterations to execute the algorithm;
57  this value must be greater than or equal to
58  zero
59  --max-time,-mt indicates the next argument is the maximum time
60  in seconds to execute the algorithm; this value
61  must be greater than or equal to zero
62  --qin,-qi indicates the next argument is the path to the
63  initial Q matrix
64  --qout,-qo indicates the next argument is the path to the
65  computed Q matrix
66  --seed,-s indicates the next argument is the seed for the
67  random number generator
68 
69  At least one of --ksize, --qin, --fin, or --force must be specified in order
70  to determine the number of components (K).
71 
72 DESCRIPTION
73  Under the assumption of Hardy Weinberg Equilibrium, the likelihood of
74  assigning an observed genotype g in individual i at locus j to component k
75  is a function of the allelic frequency f_kj of the locus at k and the
76  fraction of the genome of the individual q_ik that comes from that
77  component. We thus consider the likelihood of the ancestral component
78  proportions vector Q and their vector of allele frequencies F. In
79  particular, if we denote K as the number of ancestry components, I as the
80  number of individuals, and J as the number of polymorphic sites among the I
81  individuals, then the probability of observing the genotype is:
82 
83  sum_i sum_j {
84  g_ij * ln[sum_k (q_ik * f_kj)] +
85  (2 - g_ij) * ln[sum_k (q_ik * (1 - f_kj))]
86  }
87 
88  To estimate Q and F, we follow Newton's method. In general, we can
89  approximate a function F with its second order Taylor expansion F_T. To
90  solve this inequality- and equality-constraint quadratic optimization
91  problem, first we derive the first and second differentials for lnP1(Q, F)
92  with respect to values in Q and F, separately. Then we incorporate Lemke's
93  algorithm [Murty 1988].
94 
95  [Command-Line Output]
96 
97  Unless the --quiet option is specified, the program writes tabular
98  information to standard output. Each iteration, the program writes a row
99  with the following information:
100 
101  1. iteration Number
102  2. seconds expired during the iteration
103  3. log-likelihood after the iteration
104  4. delta log-likelihood from the previous iteration
105 
106  [Notation]
107 
108  K := Number of Components
109  This value must be greater than or equal to one.
110 
111  I := Number of Individuals
112  This value must be greater than or equal to one.
113 
114  J := Number of Markers
115  This value must be greater than or equal to one.
116 
117  G := [I x J] Discrete or Likelihood Genotype Matrix
118  dgm consists of integer values ranging from 0 to 3, inclusive.
119  0 := major-major allele
120  1 := major-minor allele
121  2 := minor-minor allele
122  3 := missing allele information
123  lgm with three floating-point value matrices in the following order.
124  n x m matrix, values 0.0 to 1.0 for minor-minor
125  n x m matrix, values 0.0 to 1.0 for major-minor
126  n x m matrix, values 0.0 to 1.0 for major-major
127 
128  F := [K x J] Frequency Matrix
129  This matrix consists of floating-point values ranging from 0 to 1.
130 
131  Q := [I x K] Proportion Matrix
132  This matrix consists of floating-point values ranging from 0 to 1. Each
133  row sums to 1.0.
134 
135  [Matrix File Format]
136 
137  A matrix file is an ASCII file in the following format. All values are
138  separated by white-space. The first value indicates the number of rows, the
139  second value indicates the number of columns, and then the remaining values
140  represent the values of the matrix in row-major order. For example, the
141  following represents a matrix with 2 rows and 4 columns:
142 
143  2 4
144  1.0 2.0 3.0 4.0
145  5.0 6.0 7.0 8.0
146 
147  In the case of the likelihood genotype matrix, three individual matrices
148  exist within the file representing minor-minor, major-minor, and major-major
149  allele frequencies. For example, the following represents a matrix with
150  2 rows (individuals) and 4 columns (markers):
151 
152  2 4
153  0.1 0.2 0.3 0.4
154  0.5 0.6 0.7 0.8
155  2 4
156  0.9 0.8 0.7 0.6
157  0.5 0.4 0.3 0.2
158  2 4
159  0.1 0.9 0.2 0.8
160  0.3 0.7 0.4 0.6
161 
162  [Forced-Grouping File Format]
163 
164  A forced-grouping file is an ASCII file in the following format. All values
165  are separated by whitespace, and lines beginning with '#' are treated as
166  comments and ignored. The first value indicates the number of individuals
167  (N), the second value indicates the number of components (K), the next N
168  values represent the population index for each individual, and the remaining
169  values represent matrices of ranges for the Q matrix. Each matrix is
170  [2K x 1] in which the first K values indicate the minimum Q values, and the
171  last K values indicate the maximum Q values. For example, the following
172  represents a forged-grouping of 15 individuals in 3 components:
173 
174  # 'I' Individuals and 'K' Components
175  15 3
176 
177  # Component Assignments per Individual
178  0 1 2 0 0 0 1 1 1 0 0 1 2 2 2
179 
180  # Population 0
181  6 1
182  0.4 0.0 0.0
183  1.0 1.0 1.0
184 
185  # Population 1
186  6 1
187  0 0 0.4 0.0
188  1.0 1.0 0.1
189 
190  # Population 2
191  6 1
192  0.0 0.0 0.4
193  1.0 0.1 1.0
194 
195 EXAMPLES
196  $ cpax -k 4 -qo ./qout.matrix -fo ./fout.matrix -mi 5 ./g.dgm
197  seed: 3964111000
198 
199  0 0.027389 -3.119424845029e+06
200  1 0.864429 -2.334222638280e+06 7.852022067495e+05
201  2 0.837323 -2.295676472449e+06 3.854616583094e+04
202  3 0.745620 -2.260783781893e+06 3.489269055576e+04
203  4 0.703584 -2.231310220386e+06 2.947356150691e+04
204  5 0.693284 -2.205170273510e+06 2.613994687624e+04
205 
206  Writing Q matrix to ./qout.matrix
207  Writing F matrix to ./fout.matrix
208 
209 BUGS
210  Report any bugs to Jade Cheng <info@jade-cheng.com>.
211 
212 Copyright (c) 2015-2020 Jade Cheng
213 )";
214 }
215 
216 ///
217 /// The main entry point of the program.
218 /// \param argc The argument count.
219 /// \param argv The argument values.
220 /// \return EXIT_SUCCESS or EXIT_FAILURE.
221 ///
222 int main(const int argc, const char * argv[])
223 {
224  try
225  {
226  jade::args args (argc, argv);
227 
228  if (args.read_flag("--help", "-h"))
229  {
230  std::cout << ::usage;
231  return EXIT_SUCCESS;
232  }
233 
234  if (args.read_flag("--version", "-v"))
235  {
236  jade::version::write("cpax", std::cout);
237  return EXIT_SUCCESS;
238  }
239 
240  typedef double value_type;
241  typedef jade::basic_settings<value_type> settings_type;
242  typedef jade::basic_optimizer<value_type> optimizer_type;
243 
244  settings_type settings (args);
245  optimizer_type::execute(
246  settings,
247  settings.get_q(),
248  settings.get_f());
249 
250  return EXIT_SUCCESS;
251  }
252  catch (const std::exception & e)
253  {
254  std::cerr << e.what() << std::endl;
255  return EXIT_FAILURE;
256  }
257 }
jade::basic_settings
A template for a class encapsulating the settings provided to the optimizer.
Definition: cpax/jade.settings.hpp:22
jade::basic_args
A template for a class that helps process command-line arguments.
Definition: jade.args.hpp:19
jade::basic_version::write
static void write(char_type const *const title, ostream_type &out)
Writes the string displayed to the user.
Definition: jade.version.hpp:29
jade::basic_optimizer
A template for a class that optimizes the Q and F matrices.
Definition: cpax/jade.optimizer.hpp:21