23andMe to plink
Raw genome data downloaded from 23andMe must be converted to a different file format before plink
can process it. The easiest conversion is to plink
’s map
and ped
formats. On this page, I’ll briefly describe these file formats and provide a Python script to perform the conversion.
23andMe’s Raw Genome File Format
A raw genome file downloaded from 23andMe is a tab-separated ASCII text file consisting of four columns for an RSID (Reference SNP Cluster ID), a chromosome, a position on that chromosome, and a pair of nucleotides. Each record in the file represents a SNP. For example,
: # Below is a text version of your data. Fields are TAB-separated. Each line # corresponds to a single SNP. For each SNP, we provide its identifier (an # rsid or an internal id), its location on the reference human genome, and the # genotype call oriented with respect to the plus strand on the human reference # sequence. We are using reference human assembly build 37 (also known as # Annotation Release 104). Note that it is possible that data downloaded at # different times may be different due to ongoing improvements in our ability # to call genotypes. : # rsid chromosome position genotype rs12564807 1 734462 AA rs3131972 1 752721 AG rs148828841 1 760998 CC rs12124819 1 776546 AA rs115093905 1 787173 GG :
These SNP records continue for several hundred thousand lines—actually 602,347 records, to be exact, in the 723 files I recently had the pleasure of analyzing.
The rsid
field provides a unique identifier for the record. No two lines in the raw genome file ever contain the same RSID. For RSIDs that begin with ‘rs
’, information regarding the SNP can be found in the dbSNP database from the National Center for Biotechnology Information.
The chromosome
field ranges from 1
to 22
, inclusive; and it may also contain entries X
, Y
, XY
, and MT
. For females, SNPs on the Y
chromosome are not available.
The position
field is an integer that ranges from 3
to 249218992
in the data I have analyzed. The combination of the chromosome
field and the position
field may not be unique within the file. Furthermore, the genotypes for SNPs with the same chromosome and position may differ as well. For example, we may see something like the following scattered throughout the file:
# rsid chromosome position genotype rs35669628 11 5246865 GG rs63751034 11 5246865 II : i4000438 15 72640388 CC i4000439 15 72640388 CC i5004858 15 72640388 GG rs76173977 15 72640388 CC i6051228 15 72640388 CC
Lastly, the genotype
field may contain one or two symbols representing nucleotides. If only one symbol is present, it is assumed the second nucleotide is the same as the first; e.g. A
represents the same information as AA
. This field may also contain dashes (--
), which indicate the SNP information is not available. Otherwise, each symbol is one of the following: A
, C
, T
, G
, D
, or I
.
plink
’s Flat File Formats
plink
reads a collection of SNP information from two files. The first file contains the pedigree and nucleotide information for an individual; this is called the ped
file. The second file specifies the order of the SNPs within the pedigree file, and it provides meta-data about those SNPs; this is the map
file.
plink
’s ped
File Format
A ped
file is a whitespace-separated ASCII text file consisting of a header and then some number of nucleotide symbols. For example,
family_id individual_id dad_id mom_id 2 -9 A A A G C C A A G G G G A A A C C C A A G G C T C C T T G G G G G G G G G G C C C C G G G G G G G G A A C C A A C C C C C C A A C C A G 0 0 G G :
Perhaps the most important field of the header is the individual’s identification code. This identifier is used to locate the individual after performing some analysis with plink
. The exact text stored for the individual’s identifier is domain specific, and in the examples that follow, I will simply use foo
.
Other identifiers in the header specify the family, father, and mother. When converting data from 23andMe’s raw genome files, however, it’s unlikely this information will be available. In the examples that follow, I will simply use foo_FAM
, foo_FATHER
, and foo_MOTHER
.
After the identifiers come two more fields for the gender and the phenotype. The gender is encoded as follows: 1
for male and 2
for female; any other value indicates the gender is unknown. The last field, the phenotype, is generally not available from the 23andMe data, and it is common to store -9
in this case, which indicates the phenotype is missing.
After the header comes (many) nucleotides, again separated by whitespace. The symbols stored here match the symbols provided by 23andMe’s raw genome files with one exception: instead of encoding missing genotype information with dashes (--
), plink
expects them to be encoded with 0
s (0 0
).
An important requirement to note is that the order of the nucleotides must match the order of the SNPs from a corresponding map
file. The number of nucleotides, therefore, will equal twice the number of SNPs present in that file.
plink
’s map
File Format
A map
file is a tab-separated ASCII text file consisting of four columns for a chromosome, a SNP identifier, an optional position, and a base-pair coordinate. Each record in the file provides information about a SNP, and the order of the records must match the order of the base-pairs present in one or more corresponding ped
files. For example,
1 rs12564807 0 734462 1 rs3131972 0 752721 1 rs148828841 0 760998 1 rs12124819 0 776546 1 rs115093905 0 787173 :
The first column of the map
file indicates the chromosome, and the values here are the same as the ones from 23andMe’s raw genome file—with a few exceptions. In samples I have seen, plink
chromosomes range from 1
to 26
, inclusive, where 23
, 24
, 25
, and 26
correspond to chromosomes X
, Y
, XY
, and MT
, respectively.
The second column indicates a variant identifier, which directly corresponds to 23andMe’s rsid
field. Every record in the map
file contains a unique identifier.
The third column indicates the variant’s position in morgans, but this column is optional. To ignore this column, we provide 0
, and this seems to be a valid and appropriate entry for many scenarios.
Finally, the fourth column indicates base-pair coordinate, which directly corresponds to 23andMe’s position
field.
The map
file does not include information specific to an individual. This means that for a set of raw genome files downloaded from 23andMe, a single map
file can be used for all individuals in the study.
If all SNPs are converted from a raw genome file from 23andMe, then the number of records in the map
file will equal the number of records in the genome
file.
Conversion Using Python
The following Python script performs the conversion described above for a file specified on the command line. The inputed genome
file is transformed into two output files with the same base-name but different extensions, map
and ped
. If gender information is available, it can be specified on the command line using the --gender
option, for which valid arguments are male
and female
.