Merge, remove, convert datasets in BED, PED-FAM, or GENO-SNP formats using PLINK and Eigensoft

Last modified: 1st November 2017

Prepare a merged dataset

I wanted to obtain many samples – especially European ones – to be able to compare with detail any individual ancient sample.

This is what I made, try this if you want and change (add or remove public datasets) as necessary.

To try the most common ancient and modern samples from the Reich lab, I downloaded the following packages.

https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz

https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/ScythianSarmatian.tar.gz

https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/MinMyc.tar.gz

You probably want to follow these steps:

mkdir BED
cd BED

wget https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz  https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/ScythianSarmatian.tar.gz  https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/MinMyc.tar.gz

tar -zxvf NearEastPublic.tar.gz
tar -zxvf ScythianSarmatian.tar.gz
tar -zxvf MinMyc.tar.gz

Each of these packages contains three files: xxx.geno / xxx.ind / xxx.snp , where xxx is the name of the dataset. From the NearEastPublic package I only selected the HumanOriginsPublic2068 files.

Put them all in the same folder (I called it BED), and within that folder place all extracted files.

I wanted to merge the files and remove some samples, which are not useful for further analyses.

Now you have to decide if you want to just merge the data – in which case I recomend working directly with binaries, which is simpler -, or if you want to add some BED or PED files – then you need to work with these files.

Work directly with binaries

You can merge datasets directly with Eigensoft.

For example, to merge the Minoans and Micenaeans dataset with that of Scythians and Sarmatians, you will need a file like this (I named it mergeMinSS):

geno1: MinMyc.geno
snp1: MinMyc.snp
ind1: MinMyc.ind
geno2: ScythianSarmatian.geno
snp2: ScythianSarmatian.snp
ind2: ScythianSarmatian.ind
indoutfilename: MinSS.ind
snpoutfilename: MinSS.snp
genooutfilename: MinSS.geno
outputformat: EIGENSTRAT

mergeit -p mergeMinSS

If you are interested, for example, in comparing it with modern populations, you can use a file like this (I named it mergeMinSSHO):

geno1: MinSS.geno
snp1:  MinSS.snp
ind1:  MinSS.ind
geno2: HumanOriginsPublic2068.geno
snp2:  HumanOriginsPublic2068.snp
ind2:  HumanOriginsPublic2068.ind
indoutfilename:  MinSSHO.ind
snpoutfilename:  MinSSHO.snp
genooutfilename:  MinSSHO.geno
outputformat: EIGENSTRAT

Now merge them
mergeit -p mergeMinSSHO

NOTES: When merging some files, you might need to add the following line at the end of the file
allowdups: YES
because there are duplicates, although this is likely to cause problems eventually in the next iterations (or later when analyzing data). You can eliminate them later – once merged – using PLINK. Also, please notice the names and order of the output files. If you use the current standard it will give errors:
genotypeoutname: Min51.geno
snpoutname: Min51.snp
indivoutname: Min51.ind

Work with PLINK bed files

This is possibly the best option to work directly with files instead of binaries – and probably your only option if you want to add remove samples and make certain changes.

Write convertf files for each dataset, including these settings. The following example is for Mycenaean data, and I named it convertfMinMyc:

genotypename: MinMyc.geno
snpname: MinMyc.snp
indivname: MinMyc.ind
outputformat: PACKEDPED
genotypeoutname: MinMyc.bed
snpoutname: MinMyc.bim
indivoutname: MinMyc.fam

Write similar files for all datasets (or data) that you want to use.

Now do:

convertf -p convertfMinMyc

You can write all merging jobs into a file (say, convertfBED.slurm) to run with Slurm
sbatch convertfBED.slurm

Now you have .bed, .bim and .fam files for your datasets, and you have to use PLINK to merge them.

If you are using Windows like me, you might want to copy or move them to your Windows machine(for example, using your shared folder).

Write a text file with the following content – including only the “secondary datasets”, not the first one, that you will select as your “main dataset” (MinMyc in this case):

HumanOriginsPublic2068.bed HumanOriginsPublic2068.bim HumanOriginsPublic2068.fam
ScythianSarmatian.bed ScythianSarmatian.bim ScythianSarmatian.fam

Then do the following in a command prompt, with all files (and MinMyc) in the same folder:

plink1.9 --bfile MinMyc --merge-list all_my_files.txt --indiv-sort 0 --make-bed --out MyMerged

NOTE. The flag --indiv-sort 0 is essential if you want to work with labels easily after working with the datasets – and you certainly want to do that.

Depending on your datasets, you will probably need to add --allow-no-sex so that ambiguous samples are left for analysis. Adding HumanOriginsPublic2068 certainly needs that flag, i.e.

plink1.9 --bfile MinMyc --merge-list all_my_files.txt --allow-no-sex --indiv-sort 0 --make-bed --out MyMerged

NOTE. If using an older version (i.e. PLINK), an “Out of memory” message is likely to pop up, and you probably need to merge datasets one by one, and maybe even split cer
tain datasets:
plink --noweb --bfile MinMyc --merge-list all_my_files.txt --make-bed --out MyMerged
plink --noweb --bfile MyMerged --merge-list all_my_files.txt --indiv-sort 0 --make-bed --out MyMerged2

etc.
(Remember that in Windows for this example to work plink.exe has to be on the same folder, in this case, or you have to call it from a different directory. In case you are using plink2.exe, you need to call plink2)

At the end of the process, you will have received some error messages regarding ambiguous samples (not clearly male nor female), which are excluded. You can follow instructions to include them, but for the sake of this example, let leave them excluded.

Remove samples

To exclude certain samples (in my case, I removed Chimp and hg19ref), you need a file following the .fam guidelines:
Chimp.REF M Chimp
Href.REF M hg19ref

I named it MyMergedRemove. Then I did:
plink1.9 --bfile MyMerged --remove MyMergedRemove --make-bed --out MyMerged2

Convert from BED to PED

To convert from BED to PED:

plink1.9 --bfile MyMerged --recode --allow-no-sex --out MyMergedPED

PED files tend to grow quite large with merge operations, so you can clean them using minor allele frequency:

plink1.9 --bfile MyMerged --maf 0.05 --hwe --recode --out MyMergedPED

You can clean also with HWE, and remove empty individuals:

plink1.9 --file MyMergedPED --hwe --mind 0.9 --recode --out MyMergedPED_clean

Will remove individuals with more than 90% missing alleles.

PED files can then be used with Eigensoft, for PCA and ADMIXTURE analysis.

Join the discussion...

It is good practice to be registered and logged in to comment.
Please keep the discussion of this post on topic.
Civilized discussion. Academic tone.
For other topics, use the forums instead.

Ein Kommentar zu „Merge, remove, convert datasets in BED, PED-FAM, or GENO-SNP formats using PLINK and Eigensoft

Leave a Reply