Posted By

leomchugh on 11/01/10


Tagged

learning Pinnacle Machine Harvest


Versions (?)

PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package


 / Published in: C++
 

Harvest paper: Citation: McHugh L, Arthur JW (2008) Computational methods for protein identification from mass spectrometry data. PLoSComputBiol 4(2): e12. doi:10.1371/journal.pcbi.0040012

PINNACLE automatically learns the statistical and neural network models for the Harvest algorithm.

Contact [email protected] for a Visual Studio project solution.

  1. /* Learner.cpp :
  2. Copyright (c) 2009, Leo McHugh, University of Sydney, Australia
  3. All rights reserved.
  4.  
  5. Redistribution and use in source and binary forms, with or without
  6. modification, are permitted provided that the following conditions are met:
  7.   * Redistributions of source code must retain the above copyright
  8.   notice, this list of conditions and the following disclaimer.
  9.   * Redistributions in binary form must reproduce the above copyright
  10.   notice, this list of conditions and the following disclaimer in the
  11.   documentation and/or other materials provided with the distribution.
  12.   * Neither the name of the <organization> nor the
  13.   names of its contributors may be used to endorse or promote products
  14.   derived from this software without specific prior written permission.
  15.  
  16. THIS SOFTWARE IS PROVIDED BY LEO MCHUGH ''AS IS'' AND ANY
  17. EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  18. WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  19. DISCLAIMED. IN NO EVENT SHALL LEO MCHUGH BE LIABLE FOR ANY
  20. DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  21. (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  22. LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
  23. ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  24. (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  25. SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  26.  
  27. This code requires input files from PreProcess.cpp
  28.  
  29. updates of this code can be found at: http://www.usyd.edu.au/sydneybioinformatics/research/software.shtml
  30.  
  31.  
  32. */
  33.  
  34. /*
  35. PURPOSE AND USE OF THIS SOFTWARE:
  36.  
  37. This software is designed for use with PINNACLE. Once PINNACLE is run, it generates a list of hits along with a
  38. confidence score for each, and prints out a list of identified fragments. An example is shown as a commented out
  39. section at the very end of this file.
  40. Learner.cpp uses these files to gather statistical information about fragmentation and to build and train a
  41. neural network.
  42.  
  43. For version1.0, only limited information will be collected.
  44.  
  45. We calculate approximate probability functions for:
  46. the parent mass error w.r.t mass
  47. fragment mass erro w.r.t mass
  48. probability of fragment identification in correct assignments w.r.t mass / charge
  49.  
  50. Neural Nets:
  51.  
  52. Inputs for predicting peak intensity for a fragment:
  53. iontype either no frag, immoium, y or b ion (nofrag, immo, y1, a1, b1) = (0,1,3,2,4); // 5 reserved for z ions but not implemented
  54. pcharge parent charge
  55. fcharge fragment charge
  56. plen the length of the parent peptide
  57. prop_len the proportion of the distance from N to C term that the fragmentation site was at
  58. pr_hkr proportion of HKR and R residues in the fragment over those in the unbroken peptide
  59. n_x identity of AA on the n-term side of cleavage
  60. c_x identity..... c-term
  61.  
  62. inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
  63.  
  64. (taken from working notes07)
  65.  
  66. ==== Add later============
  67. loss 0 if it's a 'native' ion, such as a y or b, 1 if it contains losses.
  68.  
  69. Output:
  70. log of the intensity, scaled so that the max log(intensity) for the fragment is set to 100 // maybe use later 6 Fuzzy Logic NN outputs, 0, 20, 40 etc to 100.
  71.  
  72.  
  73. */
  74.  
  75. #include "stdafx.h"
  76. #include <iostream>
  77. #include <string>
  78. #include <fstream>
  79. #include <iostream>
  80. #include <sstream>
  81. #include <vector>
  82. #include <list>
  83. #include <time.h> // for random number generation
  84.  
  85. // Learner based libraries:
  86. #include "../Utilities/Vector.h"
  87. #include "../Utilities/Matrix.h"
  88. #include "../Utilities/InputTargetDataSet.h"
  89. #include "../MultilayerPerceptron/MultilayerPerceptron.h"
  90. #include "../ObjectiveFunctional/MeanSquaredError.h"
  91. #include "../TrainingAlgorithm/QuasiNewtonMethod.h"
  92.  
  93.  
  94. using namespace std;
  95.  
  96. // ======================= FIXED PARAMETERS =====================
  97. unsigned const MAX_SEQ_SIZE = 50; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
  98. unsigned const MIN_SEQ_SIZE = 5; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
  99. unsigned outputNodes = 1; // The number of output nodes in the fuzzy network
  100. unsigned inputNodes = 8; // described above
  101. double propTesting = 0.2; // Proportion of peptides reserved for testing.
  102. double minZscore = 3.7; // minimum required Zscore for a 'hit' to be considered confident enough to be used for testing/training
  103. unsigned const buckets = 50; // mass spectrum division into this many buckets of 100 da. for statistical learning part
  104. bool fuzzyOutputs=false; // if true, splits the output to a fuzzy output, if false, there is a single output node.
  105.  
  106. // ======================= OBJECTS ==============================
  107. // ============== MUST BE THE SAME AS IN PINNACLE.CPP ===========
  108. struct fragmatch { // records a matching fragment in from an indentified peptide
  109. unsigned intensity;
  110. double obsmz;
  111. double expmz; // expected mz
  112. int fcharge; // fragment charge
  113. string fragType; // ie y1+
  114. string fragDesc; // decription of the fragment. ie: ........NQLLQFR
  115. unsigned fragSite; // from N to c term. zero is no frag, 1 is first bion / last y ion etc,
  116. double fragLOD; // LOD for this frag
  117. };
  118. struct format // converts a hit into a neural net input set
  119. {
  120. double pmasserror; // error in the pmass calculation
  121. vector<vector<double>> fmzerror; // fragment mass error array <fmass,error>
  122. vector<vector<double>> NNlines; // formatted lines to be appended to the NN training matrix.
  123. };
  124.  
  125. struct hit {
  126. string seq; // sequence of the matched peptide
  127. double Zscore; // number of std devs above the random mean
  128. int pcharge; // charge of the observed parent peptide
  129. int basichits; // the number of no loss y and b ions
  130. double prelimScore; // the proportion of the top 4N peaks covered by no loss y and b ions.
  131. double obspmass; // observed pmass of the input spectrum (uncharged)
  132. double exppmass; // theoretical pmass of the matched peptide (uncharged)
  133. vector<fragmatch> peaks; // contains the top 4N peaks
  134. vector<double> mods; // one entry for each AA from N to C term
  135. double overallLOD; // LOD score generated by PINNACLE for this peptide match
  136. };
  137. struct learnedParameters { // holds the parameters learned by this file for passing to PINNACLE
  138. // THIS STRUCTURE PASSES ANY LEARNED PARAMETERS TO PINNACLE
  139. // IF YOU WANT TO INCLUDE MORE STATISTICAL LEARNING, ADD THE LEARNED PARAMETERS
  140. // TO THIS OBJECT, AND UPDATE PINNACLE SO THAT IT CAN LOAD THE SAME OBJECT
  141. // THIS OBJECT IS WRITTEN TO BINARY FILE AFTER LEARNING SO THAT PINNACLE CAN READ IT IN
  142. double pmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for parent mass errors
  143. double fmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for fragment mass errors
  144.  
  145. // FEEL FREE TO ADD MORE LEARNED PARAMETERS IN HERE (REMEMBERING TO UPDATE THE SAME OBJECT IN PINNACLE
  146.  
  147. };
  148. // =================================== FILE LOCATIONS ================================================================
  149. // location of the output of the PINNACLE run (this is a serial file writen by PINNACLE storing all the hits as objects)
  150. char *inputFile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\output.ser"; // .ser (serial) - see PINNACLE for format details
  151.  
  152. // the below files are all outputs from this program. savedModel and learnedParameters store the neural network and
  153. // other general learned parameters respectively and these 2 files must be written to the working directory of PINNACLE
  154.  
  155. // neural network related
  156. char *tempTrainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\tempTrainingSet.dat"; // just contains instances
  157. char *trainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingSet.dat";
  158. char *testingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\testingSet.dat";
  159. char *trainingHistory = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingHistory.dat";
  160. char *savedModel = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Debug\\multilayerPerceptron.dat";
  161.  
  162. // other learned parameters (see object for details)
  163. char *learnedParams = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\learnedParameters.bin"; // binary
  164.  
  165. // ===================================== FUNCTIONS ===========================
  166. // LOAD IN AND DATA PREPARATION
  167. void createTrainingFile();
  168. hit loadInHit(ifstream&); // loads in the next hit
  169. vector<string> processHit(hit, bool); // returns a block of lines which are converted into a NN input file (Flood style)
  170. double getMean(vector<double>&); // returns the mean
  171. double getStdDev(vector<double>&, double); // returns the standard deviation of input array, second arg is the precomputed vector mean.
  172. void learnMassErrors(); // learns parent and fragment mass errors for this dataset
  173.  
  174. // TESTING OF THE MODEL
  175. void loadAndProcessTrainingFile();
  176. void testModel(); // compares against a flat spectrum.
  177. double dotProduct(vector<double>, vector<double>); // return inner dot product of input vectors
  178. double testPeptide(vector<double>); // returns the dot product of a flat and the predicted spect.
  179. fragmatch findMatch(string, unsigned, hit); // (iontype, breakSite,matchArray)
  180. vector<double> fuzzyOut(float, float); // produces a fuzzy output of the intensity
  181. void trainingSetHeader(ofstream&,int,int,int,string,string);// sets up the header for the output training file
  182.  
  183. // OTHER HANDY FUNCTIONS
  184. vector<double> normalise(vector<double>, double sumTotal); // takes an input vector and normalises to sum to 100
  185. vector<double> convert2mods(string); //
  186. vector<double> convert2doubleVect(string); // converts a space separated line into a vector of doubles (from test set)
  187. vector<fragmatch> convert2peaks(string); // used during loading in to convert .ser file to a hit object
  188. double convert2double(string); // converts a string to a double
  189. int convert2int(string); // converts a string to an int
  190. string convert2str(vector<double>); // converts doubles into a string for output
  191. double encodeAA(char);
  192. double countHKR(string&);
  193. string stringify(double x); // convert double to string
  194.  
  195. // Neural network
  196. Flood::Vector<double> getAttributes(fragmatch& fm,string seq, unsigned pcharge);
  197. Flood::MultilayerPerceptron ANN;
  198. void printFloodVector(Flood::Vector<double>& v);
  199.  
  200.  
  201. // ============== DEBUGGING ===========================
  202. int verbosity = 4;
  203.  
  204. int _tmain(int argc, _TCHAR* argv[])
  205. {
  206. // NEURAL NETWORK LEARNING
  207. // reads the identified peptides and creates a NN training file
  208. // createTrainingFile();
  209.  
  210. // imports the processed training file, sets the neural net parameters for the model and learning
  211. // loadAndProcessTrainingFile();
  212.  
  213. // tests neural network model against the testing file generated by createTrainingFile.
  214. testModel();
  215.  
  216. // STATISTICAL MACHINE LEARNING
  217. // collects basic stats on hits to optimise search parameters.
  218. // learnMassErrors(); // collects stats on parent and fragment mass erros.
  219.  
  220.  
  221. return 0;
  222. }
  223. void learnMassErrors() { // learn the mass errors from the hits
  224. cout << "Learning from parent and fragment mass errors from: " << inputFile << endl;
  225. ifstream inputHits;
  226. inputHits.open(inputFile); // load in the the input file in binary format
  227.  
  228. learnedParameters lp; // object for storing learned parameters
  229.  
  230. // we divide the mass range up into buckets of 100 Da, storing a mean and variance for each bucket
  231. // PINNACLE will then use this information in a prob density function
  232. vector<vector<double>> pmassbuckets; // stores the parent mass errors for each bin
  233. vector<vector<double>> fmassbuckets; // stores the fragment mass errors for each bin
  234. for (unsigned i=0; i < buckets; i++) {
  235. vector<double> v;
  236. pmassbuckets.push_back(v); // create elements (blank memory spacers for the time being)
  237. fmassbuckets.push_back(v); // create elements
  238. }
  239.  
  240. unsigned hitsLoaded=0;
  241. while (!inputHits.eof()) {
  242. hit h = loadInHit(inputHits);
  243. if (h.seq.compare("END")==0) {cout << "hits loaded: " << hitsLoaded << endl; break;} // no more available (loadInHit returns 'END' at eof)
  244. if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
  245. hitsLoaded++;
  246.  
  247. // RECORD PARENT MASS ERROR
  248. unsigned index = (unsigned) floor(h.obspmass/100); // assign bucket (100 Da per bucket)
  249. if (index > buckets-1) {index = buckets-1;} // prevent overflow
  250. //double modmass =0; // account for mods
  251. //for (unsigned i=0; i < h.mods.size(); i++) {modmass+=h.mods.at(i);}
  252. double difference = h.obspmass-(h.exppmass);
  253. if (fabs(difference) > 1) {
  254. cout << "obs mass: " << h.obspmass << " expmass: " << h.exppmass << endl; // " modmass: " << modmass << endl;
  255. cout << "unexpectedly high parent mass error - skip"; continue; // sometimes a peptide is IDed twice, each with different mods, keep searching..
  256. }
  257. pmassbuckets.at(index).push_back(difference); // record parent mass error
  258.  
  259. // RECORD FRAGMENT MASS ERRORS
  260. for (unsigned i=0; i<h.peaks.size(); i++) {
  261. if (h.peaks.at(i).obsmz == 0) {continue;} // no matching peak
  262. unsigned index = (unsigned) floor(h.peaks.at(i).obsmz/100); // assign bucket (100 Da per bucket)
  263. if (index > buckets-1) {index = buckets-1;} // prevent overflow
  264. // using absolute error values for frag mass due to found bimodal
  265. // distributions for some datasets. non absolute values would be
  266. // better (but can't then use mean and std dev in probability functions
  267. //cout << " fmass: " << h.peaks.at(i).obsmz << " error: " << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;
  268. fmassbuckets.at(index).push_back(fabs(h.peaks.at(i).obsmz-h.peaks.at(i).expmz)); // record parent mass error
  269. //if (fragError.at(1) < 0 && h.peaks.at(i).obsmz > 700) {cout << "UNDER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
  270. //if (fragError.at(1) > 0 && h.peaks.at(i).obsmz > 700) {cout << " OVER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
  271. }
  272. } // end loading in all hits from file
  273. inputHits.close();
  274.  
  275. // SHOW SOME OF THE DISTROS IN THE BUCKETS
  276. unsigned testBucket = 8;
  277. cout << "bucket" << testBucket << "={";
  278. for (unsigned i=0; i < fmassbuckets.at(testBucket).size(); i++) {cout << fmassbuckets.at(testBucket).at(i) << ',';}
  279. cout << "};\n";
  280.  
  281. exit(0);
  282.  
  283. // GET PARAMETERS FOR MEAN AND VARIANCE FOR EACH BUCKET
  284. unsigned minDatapoints = 10; // the minimum number of values in a bucket for the calculation of mean and std dev
  285.  
  286. // ================ PARENT MASS ERRORS ============================================================
  287. vector<vector<double>> pe; // pe = pmass errors for each bucket, stores <mean, stdDev>
  288. for (unsigned bucket=0; bucket < buckets; bucket++) {
  289. vector<double> v;
  290. v.clear();
  291. if (pmassbuckets.at(bucket).size() < minDatapoints) {pe.push_back(v); continue;} // too few values to calculate
  292. double mean = getMean(pmassbuckets.at(bucket));
  293. double stdDev = getStdDev(pmassbuckets.at(bucket), mean);
  294. //cout << "massbucket: " << bucket << ": {";
  295. //for (unsigned j=0; j<pmassbuckets.at(bucket).size(); j++) {cout << pmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
  296. //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
  297. v.push_back(mean); v.push_back(stdDev);
  298. pe.push_back(v);
  299. }
  300.  
  301. // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
  302. // first back-propogate the values to leading empty buckets
  303. for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
  304. if (pe.at(i).empty()) {continue;}
  305. vector<double> v = pe.at(i); // first non-empty bucket
  306. while (1) {pe.at(i) = v; if (i==0) {break;} i--;}
  307. break;
  308. }
  309.  
  310. // now propagate forward any values to any empty buckets
  311. vector<double> v = pe.at(0); // first bucket's value (mean,stdDev)
  312. for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
  313. if (pe.at(i).empty()) {pe.at(i)=v; continue;} // assign same values as bucket to the left
  314. v = pe.at(i); // update
  315. }
  316.  
  317. cout << "pmass errors: \n"; for (unsigned i=0; i < pe.size(); i++) {
  318. if (pe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  319. else {cout << "bucket: " << i << " {" << pe.at(i).at(0) << ',' << pe.at(i).at(1) << "}\n";}
  320. } cout << endl;
  321.  
  322. // load into object
  323. for (unsigned i=0; i < buckets; i++) { // pe = parent errors
  324. lp.pmassErrors[i][0] = pe.at(i).at(0); // mean
  325. lp.pmassErrors[i][1] = pe.at(i).at(1); // standard Deviation
  326. }
  327.  
  328.  
  329. // ================ FRAGMENT MASS ERRORS ============================================================
  330. vector<vector<double>> fe; // fe = fragment errors for each bucket, stores <mean, stdDev>
  331. for (unsigned bucket=0; bucket < buckets; bucket++) {
  332. vector<double> v;
  333. v.clear();
  334. if (fmassbuckets.at(bucket).size() < minDatapoints) {fe.push_back(v); continue;} // too few values to calculate
  335. double mean = getMean(fmassbuckets.at(bucket));
  336. double stdDev = getStdDev(fmassbuckets.at(bucket), mean);
  337. //cout << "fmassbucket: " << bucket << ": {";
  338. //for (unsigned j=0; j<fmassbuckets.at(bucket).size(); j++) {cout << fmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
  339. //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
  340. v.push_back(mean); v.push_back(stdDev);
  341. fe.push_back(v);
  342. }
  343.  
  344. // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
  345. // first back-propogate the values to leading empty buckets
  346. for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
  347. if (fe.at(i).empty()) {continue;}
  348. vector<double> v = fe.at(i); // first non-empty bucket
  349. //cout << "first non-empty fragmass bucket: {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}" << endl;
  350. while (1) {fe.at(i) = v; if (i==0) {break;} i--;}
  351. break;
  352. }
  353.  
  354. // now propagate forward any values to any empty buckets
  355. v = fe.at(0); // first bucket's value (mean,stdDev)
  356. for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
  357. if (fe.at(i).empty()) {fe.at(i)=v; continue;} // assign same values as bucket to the left
  358. v = fe.at(i); // update
  359. }
  360.  
  361. cout << "fragment errors: \n"; for (unsigned i=0; i < fe.size(); i++) {
  362. if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  363. else {cout << "bucket: " << i << " {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}\n";}
  364. } cout << endl;
  365.  
  366. // output for Mathematica graphing - buckets of 100 da {{fragmass,AvgError,StdDev},{...}}
  367. cout << "ferrors={"; for (unsigned i=0; i < fe.size(); i++) {
  368. if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  369. else {cout << '{' << i*100 << ',' << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "},";}
  370. } cout << '}' << endl;
  371.  
  372.  
  373. // load into object
  374. for (unsigned i=0; i < buckets; i++) { // pe = parent errors
  375. lp.fmassErrors[i][0] = fe.at(i).at(0); // mean
  376. lp.fmassErrors[i][1] = fe.at(i).at(1); // standard Deviation
  377. }
  378.  
  379.  
  380.  
  381. ofstream out; // write object to file
  382. out.open(learnedParams, ios::binary);
  383. out.write((char *)(&lp),sizeof(lp)); // write object out to binary file (used by PINNACLE)
  384. out.close();
  385.  
  386. cout << "quiting after learned parameters writen to file";
  387. exit(0);
  388. }
  389. void testModel() {
  390. // first load in the model
  391. Flood::MultilayerPerceptron ANN; ANN.load(savedModel);
  392. cout << "loading ANN from " << savedModel << endl;
  393.  
  394. ifstream testingSet;
  395. testingSet.open(testingfile);
  396. vector<double> improvements;
  397.  
  398. // load in a single instance from the test set
  399. string line;
  400.  
  401. while (!testingSet.eof()) {
  402. vector<vector<double>> instance; // instance of the matching hits for a given peptide
  403. instance.clear();
  404. line = "NULL";
  405. while (!line.empty() && !testingSet.eof()) {
  406. getline(testingSet, line );
  407. if (!line.empty()) {
  408. //cout << " pushing back line: " << line << " of size: " << convert2doubleVect(line).size() << endl;
  409. instance.push_back(convert2doubleVect(line));
  410. }
  411. }
  412. //cout << "instance loaded with " << instance.size() << endl;
  413.  
  414. vector<double> predicted;
  415. vector<double> flat(instance.size(),1);
  416. vector<double> observed;
  417.  
  418. for (unsigned i=0; i < instance.size(); i++) {
  419. vector<double> input = instance.at(i);
  420. double obsPeakInt = input.back(); // the correct output
  421. observed.push_back(obsPeakInt);
  422. input.pop_back(); // remove the last element (answer)
  423. // need to convert a normal vector to a flood style vector
  424. Flood::Vector<double> attributes(input.size());
  425. for (unsigned j=0; j<input.size(); j++) {attributes[j] = input.at(j);}
  426.  
  427.  
  428. Flood::Vector<double> fuzzyOutput = ANN.getOutput(attributes);
  429.  
  430. if (fuzzyOutputs) {
  431. // take the highest node as the prediction of intensity (rounded to nearest 20%)
  432. unsigned maxnode=0; double maxval=0;
  433. for (int k=0; k < fuzzyOutput.getSize(); k++) {
  434. if (fuzzyOutput[k] > maxval) {maxnode=k; maxval=fuzzyOutput[k];}
  435. }
  436. predicted.push_back(maxnode); // just take the node with the highest output
  437. }
  438. else {
  439. //cout << "attributes: "; for (int k=0; k < attributes.getSize(); k++) {cout << attributes[k] << ' ';}
  440. //cout << "predicted output: " << fuzzyOutput[0] << endl;
  441. predicted.push_back(fuzzyOutput[0]); // just the raw single number coming out.
  442. //double exponent = pow(2.17,fuzzyOutput[0]);
  443. //predicted.push_back(exponent); // blow it back up
  444. }
  445.  
  446.  
  447. // print out the fuzzy nodes, max node and observed intensity
  448. /*cout << "fuzzyOut " << i << ": ";
  449. for (int k=0; k < fuzzyOutput.getSize(); k++) {
  450. char buffer [50];
  451. int n, a=5, b=3;
  452. n=sprintf_s(buffer, 50, "%0.3f",fuzzyOutput[k]);
  453. cout << buffer << " ";
  454. }cout << "| " << maxnode*0.2 << " | " << obsPeakInt << endl;
  455. */
  456. } // end prediction of each peak height
  457.  
  458. // end of single test peptide
  459. // Normalise vectors
  460. predicted = normalise(predicted,100); // normalises spectrum to a sum total of 100
  461. flat = normalise(flat,100);
  462. observed = normalise(observed,100);
  463.  
  464. // show normalized vectors
  465. cout << "flat : {"; for (unsigned i=0; i < flat.size()-1; i++) {cout << flat.at(i) << ',';} cout << flat.back() << "}\n";
  466. cout << "predicted: {"; for (unsigned i=0; i < predicted.size()-1; i++) {cout << predicted.at(i) << ',';} cout << predicted.back() << "}\n";
  467. cout << "observed : {"; for (unsigned i=0; i < observed.size()-1; i++) {cout << observed.at(i) << ',';} cout << observed.back() << "}\n";
  468.  
  469. double flatDP = dotProduct(observed,flat);
  470. double annDP = dotProduct(observed,predicted);
  471. improvements.push_back(annDP/flatDP);
  472.  
  473. cout << "imp: " << annDP/flatDP << " flatDP: " << flatDP << " annDP: " << annDP << endl << endl;
  474.  
  475. //exit(0);
  476.  
  477. }
  478.  
  479.  
  480. cout << "improvements={"; for (unsigned i=0; i < improvements.size(); i++) {cout << improvements.at(i) << ',';} cout << "};";
  481.  
  482. // use a dot product model.
  483. // for the FOUND peaks, create a spectrum. then normalise
  484.  
  485.  
  486.  
  487. return;
  488. }
  489. double dotProduct(vector<double> v1, vector<double> v2) {
  490. double dp=0;
  491. if (v1.size() != v2.size()) {cout << "vectors not equal length!"; exit(1);}
  492. for (unsigned i=0; i < v1.size(); i++) {
  493. dp += v1.at(i)*v2.at(i);
  494. }
  495. return dp;
  496. }
  497.  
  498. vector<double> normalise(vector<double> v, double sumTotal) {
  499. double sum = 0;
  500. for (unsigned i=0; i < v.size(); i++) {sum+=v.at(i);}
  501. // make sum to 100
  502. double scalar = sumTotal/sum;
  503. for (unsigned i=0; i < v.size(); i++) {v.at(i)*=scalar;}
  504. return v;
  505.  
  506. }
  507. double testPeptide(vector<double> fragmatches) { // returns a dot product for the predicted and a flat spect.
  508.  
  509. return 0;
  510. }
  511. vector<double> convert2doubleVect (string line) { // converts a space separate line into a double vector
  512. unsigned i=0; unsigned j=0;
  513. vector<double> out;
  514. while (j < line.size()) {
  515. if (j == line.size()-1) { // last one
  516. string tmp = line.substr(i,j-i+1); //cout << "element: " << tmp << endl;
  517. double val = convert2double(tmp);
  518. out.push_back(val);
  519. }
  520. if (line.at(j) != ' ' && j < line.size()) {j++; continue; }
  521. string tmp = line.substr(i,j-i); //cout << "element: " << tmp << endl;
  522. double val = convert2double(tmp);
  523. out.push_back(val);
  524. i=j; j++;
  525. }
  526. //exit(0);
  527. return out;
  528.  
  529. }
  530. void loadAndProcessTrainingFile() {
  531.  
  532. // LOAD IN TRAINING SET INTO NN FOR LEARNING>
  533. Flood::InputTargetDataSet inputTargetDataSet ;
  534. inputTargetDataSet.load(trainingfile);
  535.  
  536. // set up a new neural network
  537. Flood::MultilayerPerceptron multilayerPerceptron(inputNodes,inputNodes,outputNodes); // input, middle layer, output - same number in middle layer as inputs
  538. multilayerPerceptron.initFreeParametersNormal(0.0,1.0); // initialise the network within the chosen range
  539.  
  540. // set up the names of the variables (both in and out)
  541. Flood::Vector<string> nameOfInputVariables = inputTargetDataSet.getNameOfInputVariables();
  542. Flood::Vector<string> nameOfTargetVariables = inputTargetDataSet.getNameOfTargetVariables();
  543. multilayerPerceptron.setNameOfInputVariables(nameOfInputVariables);
  544. multilayerPerceptron.setNameOfOutputVariables(nameOfTargetVariables);
  545.  
  546.  
  547.  
  548.  
  549. // get the parameters for the input data - bound in necessary
  550. //Flood::Matrix<double> meanAndStandardDeviationOfInputData = inputTargetDataSet.getMeanAndStandardDeviationOfInputData();
  551. Flood::Matrix<double> minimumAndMaximumOfInputData = inputTargetDataSet.getMinimumAndMaximumOfInputData();
  552.  
  553. Flood::Vector<double> minOfInputVariables(inputNodes); // these values are not inclusive, so min must be lower than any expected input
  554. minOfInputVariables[0] = -0.01; // iontype
  555. minOfInputVariables[1] = 0.99; // pcharge
  556. minOfInputVariables[2] = 0.99; // fcharge
  557. minOfInputVariables[3] = MIN_SEQ_SIZE; // plen
  558. minOfInputVariables[4] = -0.01; // prop_len
  559. minOfInputVariables[5] = -0.01; // pr_hkr
  560. minOfInputVariables[6] = -0.01; // n_x
  561. minOfInputVariables[7] = -0.01; // c_x
  562. multilayerPerceptron.setMinimumOfInputVariables(minOfInputVariables);
  563.  
  564. Flood::Vector<double> maxOfInputVariables(inputNodes);
  565. maxOfInputVariables[0] = 4.01; // iontype
  566. maxOfInputVariables[1] = 3.01; // pcharge
  567. maxOfInputVariables[2] = 2.01; // fcharge
  568. maxOfInputVariables[3] = MAX_SEQ_SIZE; // plen
  569. maxOfInputVariables[4] = 1.01; // prop_len
  570. maxOfInputVariables[5] = 1.01; // pr_hkr
  571. maxOfInputVariables[6] = 1.01; // n_x
  572. maxOfInputVariables[7] = 1.01; // c_x
  573. multilayerPerceptron.setMaximumOfInputVariables(maxOfInputVariables);
  574.  
  575.  
  576. // set up the relevant max and min etc for the output neuron(s)
  577. Flood::Matrix<double> minimumAndMaximumOfTargetData = inputTargetDataSet.getMinimumAndMaximumOfTargetData();
  578. cout << "min encountered output: " << minimumAndMaximumOfTargetData[0][0] << endl;
  579. cout << "max encountered output: " << minimumAndMaximumOfTargetData[1][0] << endl;
  580. double minBuffer = minimumAndMaximumOfTargetData[0][0] - minimumAndMaximumOfTargetData[0][0]/10; // min plus some buffer for unexpectedly low results
  581. double maxBuffer = minimumAndMaximumOfTargetData[1][0] + minimumAndMaximumOfTargetData[1][0]/10;
  582. cout << "minBuffer: " << minBuffer << " maxBuffer: " << maxBuffer <<endl;
  583.  
  584. Flood::Vector<double> minOfOutputVariables(outputNodes);
  585. minOfOutputVariables[0] = minBuffer; // 10% buffer under the min value encountered in training set
  586. Flood::Vector<double> maxOfOutputVariables(outputNodes);
  587. maxOfOutputVariables[0] = maxBuffer; // 10% buffer over the max value encountered in training set
  588. multilayerPerceptron.setMinimumOfOutputVariables(minOfOutputVariables);
  589. multilayerPerceptron.setMaximumOfOutputVariables(maxOfOutputVariables);
  590.  
  591. multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MinimumAndMaximum);
  592. multilayerPerceptron.print();
  593.  
  594. Flood::MeanSquaredError meanSquaredError(&multilayerPerceptron,&inputTargetDataSet);
  595. //multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MeanAndStandardDeviation);
  596. Flood::QuasiNewtonMethod quasiNewtonMethod(&meanSquaredError);
  597. quasiNewtonMethod.setEvaluationGoal(0.0);
  598. quasiNewtonMethod.setGradientNormGoal(0.0001);
  599. quasiNewtonMethod.setMaximumTime(320.0); // seconds
  600. quasiNewtonMethod.setMaximumNumberOfEpochs(20);
  601. quasiNewtonMethod.train();
  602.  
  603.  
  604. quasiNewtonMethod.saveTrainingHistory(trainingHistory);
  605. multilayerPerceptron.save(savedModel);
  606. cout << "neural network saved to: " << savedModel << endl;
  607.  
  608.  
  609. return;
  610. }
  611. void createTrainingFile() {
  612. // READ IN IDENTIFIED PEPTIDES
  613. cout << "Learning from: " << inputFile << endl;
  614. ifstream inputHits;
  615. inputHits.open(inputFile); // // load in the the input file in binary format
  616.  
  617. ofstream tempTrainingSet;
  618. tempTrainingSet.open(tempTrainingfile);
  619.  
  620. ofstream trainingSet; // writes to immediately if we're not using fuzzy outputs (as it doesn't need to be balanced).
  621. trainingSet.open(trainingfile);
  622.  
  623. ofstream testingSet;
  624. testingSet.open(testingfile);
  625.  
  626. ofstream testTrainLog;
  627. testTrainLog.open("testTrain.log");
  628.  
  629. vector<vector<double>> pmasses; // records <mass,error> for all parent masses
  630. vector<vector<double>> fragmasses; // records <mass,error> for all fragment masses
  631.  
  632. vector<string> trainSet; // training set - saved in MultiLayerPerceptron input file
  633. vector<string> testSet; // testing set - saved as a text list of inputs with correct output as last figure.
  634.  
  635. string line ;
  636. int verbosity = 0;
  637. double loaded = 0; // total
  638. double testing = 0; // those used in testing
  639. while (!inputHits.eof()) {
  640. hit h = loadInHit(inputHits); loaded++;
  641. if (h.seq.compare("END")==0) {break;} // no more available (loadInHit returns 'END' at eof)
  642. if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
  643. bool train; // if set 1, used for training, else used for testing
  644. if ((testing/loaded) <= propTesting) {train=0; testing++;} else {train=1;} // if not train, then test
  645. cout << "processing " << h.seq << endl;
  646.  
  647. vector<string> instances = processHit(h,train); // creates attributes for training with this peptide
  648.  
  649. //for (unsigned i=0; i < instances.size(); i++) {
  650. // cout << instances.at(i);
  651. //}
  652.  
  653. if (train==0) { // peptide allocated for testing set - dump to file
  654. for (unsigned i=0; i < instances.size(); i++) {testingSet << instances.at(i);}
  655. testingSet << endl; // peptides separated by newlines.
  656. testing++;
  657. }
  658. if (train==1) { // peptide allocated for testing set - save later writing
  659. for (unsigned i=0; i < instances.size(); i++) {trainSet.push_back(instances.at(i));}
  660. }
  661. //if (processed > 50) {break;}
  662. //break;
  663. }
  664.  
  665. //cout << "training set size: " << trainSet.size(); exit(0);
  666.  
  667. string inputs = "ION_TYPE PCHARGE FCHARGE PLEN PROPLEN PRHKR N_X C_X"; //inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
  668. string outputs = "LOG_INTEN_NORM"; // proportion of the total intensity
  669.  
  670. trainingSetHeader(trainingSet, trainSet.size(), inputNodes, outputNodes, inputs, outputs);
  671. for (unsigned i=0; i < trainSet.size(); i++) {trainingSet << trainSet.at(i);}
  672. return;
  673. }
  674. vector<string> processHit(hit h, bool train) { // set train to 1 if it's going to the training set, else it goes to the testing set
  675. vector<Flood::Vector<double>> matrix; // a vector of Flood Style vectors
  676. for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
  677. fragmatch fm = h.peaks.at(i);
  678. Flood::Vector<double> attributes = getAttributes(h.peaks.at(i),h.seq,h.pcharge);
  679. if (attributes.getSize()==0) {
  680. cout << "no attributes generated for peak: " << i << " with " << fm.fragType << endl;
  681. continue;
  682. } // no identified matching fragment
  683. //cout << "for hit: " << fm.fragType << " " << fm.fragDesc << " " << " int: " << fm.intensity;
  684. //cout << " attributes="; printFloodVector(attributes);
  685. matrix.push_back(attributes);
  686. }
  687.  
  688. // PEAK PROCESSING GOES IN HERE
  689. unsigned sumIntensities=0;
  690. for (unsigned i=0; i < h.peaks.size(); i++) {sumIntensities+=h.peaks.at(i).intensity;}
  691.  
  692. //cout << "sumInt: " << sumIntensities << endl;
  693.  
  694. vector<double> processedIntensities;
  695. for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
  696. fragmatch fm = h.peaks.at(i);
  697. if (fm.expmz == 0) {continue;}
  698. unsigned intensity = fm.intensity;
  699. //double prop = ((double) intensity / (double) sumIntensities);
  700. double processed = (double) intensity/ (double) sumIntensities;
  701. if (processed>0.2) {processed=0.2;}
  702. processedIntensities.push_back(processed);
  703. }
  704. //exit(0);
  705.  
  706. //processedIntensities = normalise(processedIntensities,100);
  707.  
  708. // convert the matrix to strings of attributes with the processed answer
  709. vector<string> outBlock; // each string corresponds to a line in a training or testing set
  710. for (unsigned i=0; i < matrix.size(); i++) {
  711. Flood::Vector<double> in = matrix.at(i);
  712. string out;
  713. for (int j=0; j < in.getSize(); j++) {
  714. out+=stringify(in[j]);
  715. out+=string(" ");
  716. }
  717. out+=stringify(processedIntensities.at(i));
  718. out+="\n";
  719. outBlock.push_back(out);
  720. }
  721. return outBlock; /// just a list of all hits and misses (with fuzzy outputs).
  722. }
  723. void trainingSetHeader(ofstream& f, int samples, int inputs, int outputs, string inputNames, string outputNames) {
  724. f << "% Testing Dataset for PINNACLE\n";
  725. f << "NumberOfSamples:\n" << samples << endl;
  726. f << "NumberOfInputVariables:\n" << inputs << endl;
  727. f << "NumberOfTargetVariables:\n" << outputs << endl;
  728. f << "NameOfInputVariables:\n" << inputNames << endl;
  729. f << "NameOfTargetVariables:\n" << outputNames << endl;
  730. f << "InputAndTargetData:\n";;
  731. return;
  732. }
  733. string convert2str(vector<double> v) { // adds newline at end
  734. string out;
  735. for (unsigned i=0; i < v.size(); i++) {
  736. char buffer [50]; int n;
  737. n=sprintf_s (buffer, 50, "%4.6f", v.at(i));
  738. out += string(buffer);
  739. if (i<v.size()-1) {out+=" ";} else {out+="\n";}
  740.  
  741. }
  742. //cout << "string out: " << out;
  743.  
  744. return out;
  745.  
  746. }
  747. vector<double> fuzzyOut(float intensity, float maxpeak) { // maxpeak may also be the sumOfPeaks if normalising against total peak intensities.
  748. vector<double> out (outputNodes,0);
  749. double prop = (double) intensity/(double) maxpeak; // the proportional intensity of this peak w.r.t. the maximum in the spect
  750. prop*=1000; // multiplier to make all values greater then zero (since we'll be taking a log)
  751.  
  752. // DIRECT OUTPUT - LOG of the proportion of the sum of intensities this fragment has. Gives us a nice normal curve for learning.
  753. if (intensity==0) {out.at(0)=0;}
  754. else {out.at(0) = log(prop); if (out.at(0) < 0) {out.at(0) =0;}} // don't want negative logs thanks
  755. return out;
  756.  
  757. // FUZZY OUTPUT
  758. //cout << "for proportion: " << prop << endl;
  759. for (unsigned i=0; i < outputNodes; i++) {
  760. double node = i*(1/((double) outputNodes-1)); //cout << "node " << i << " with value " << node << endl; // current node
  761. double diff = fabs(prop-node);
  762. double decay = 3*diff; // decay will be zero when bang on - tells the program how much weight to put on neighbouring nodes.
  763. double value = 1-decay; if (value < 0){value=0;}// cout << "diff: " << diff << " decay: " << decay << " value: " << value << endl;
  764. out.at(i) = value;
  765. }
  766. cout << "fuzzyOut: <"; for (unsigned i=0; i < outputNodes; i++) {cout << out.at(i) << ',';} cout << '>' << endl;
  767.  
  768. return out;
  769. }
  770. fragmatch findMatch(string iontype,unsigned breakSite,hit h) {
  771. // always returns the first such hit. Ie ignores whether is a water loss etc.
  772.  
  773. // basically, scan through the list of matches, and if a hit is found, return it.
  774. //cout << "searching for ion type: " << iontype << " with breakSite at " << breakSite << endl;
  775.  
  776. if (iontype.compare("nofrag")==0) { // parent peptide
  777. for (unsigned i=0; i < h.peaks.size(); i++) {
  778. fragmatch f = h.peaks.at(0);
  779. if (!f.fragType.substr(0,6).compare("nofrag")) {return f;}
  780. }
  781. }
  782. if (iontype.compare("immo")==0) { // immonium fragmatches have a default fragSite of 0.
  783. // here we just need to find a fragment match with the right description.
  784. char aa = h.seq.at(breakSite); // this is the specifc immonium ion we're looking for
  785. //cout << "looking for immo ion: " << aa << endl;
  786. for (unsigned i=0; i < h.peaks.size(); i++) {
  787. fragmatch f = h.peaks.at(i);
  788. //cout << "looking at peak: " << f.fragType << " obsmz: " << f.obsmz << endl;
  789. if (f.obsmz == 0) {continue;} // there is no match here.
  790. if (f.fragType.at(0) == 'i' && f.fragDesc.at(f.fragDesc.size()-1) == aa) {
  791. //cout << "this peak has been found: " << f.fragType << " intensity: " << f.intensity << endl;
  792. return f;
  793. }
  794. }
  795. }
  796. if (iontype.substr(0,2).compare("y1") == 0) {
  797. for (unsigned i=0; i < h.peaks.size(); i++) {
  798. fragmatch f = h.peaks.at(i);
  799. //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
  800. if (f.fragSite == breakSite) {
  801. if (!f.fragType.substr(0,2).compare("y1")) {return f;}
  802. }
  803. }
  804. }
  805. if (iontype.substr(0,2).compare("b1") == 0) {
  806. for (unsigned i=0; i < h.peaks.size(); i++) {
  807. fragmatch f = h.peaks.at(i);
  808. //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
  809. if (f.fragSite == breakSite) {
  810. if (!f.fragType.substr(0,2).compare("b1")) {return f;}
  811. }
  812. }
  813. }
  814.  
  815. // getting to here means we didn't find a matching peak.
  816. fragmatch nomatch;
  817. nomatch.obsmz = 0; // this how we define a missing peak.
  818. return nomatch;
  819.  
  820. }
  821.  
  822. hit loadInHit(ifstream& file) {
  823. // given the filestream to the output file, loads in a hit. Should be replaced with binary files.
  824. // seq:pcharge:obspmass:exppmass:stdDevs:basicHits:prelimScore:LOD:{mod0:mod1..}:{match1,match2...}:overallLOD\n
  825. // where match = [intensity:obsfmass:expfmass:type:description:LOD]
  826. string line;
  827. hit h;
  828.  
  829. getline( file, line );
  830. if (file.eof()) {h.seq = "END"; return h;}
  831.  
  832. //cout << line;
  833. unsigned i=0; unsigned j=0; while (line.at(j) != ':') {j++;}
  834. h.seq = line.substr(i,j-i); // identified sequence
  835. if (verbosity > 3) {cout << "seq loaded: " << h.seq << endl;}
  836.  
  837.  
  838. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.pcharge = convert2int(line.substr(i,j-i));
  839. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.obspmass = convert2double(line.substr(i,j-i));
  840. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.exppmass = convert2double(line.substr(i,j-i));
  841. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.Zscore = convert2int(line.substr(i,j-i));
  842. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.basichits = convert2int(line.substr(i,j-i));
  843. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.prelimScore = convert2double(line.substr(i,j-i));
  844. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.overallLOD = convert2double(line.substr(i,j-i));
  845.  
  846. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.mods = convert2mods(line.substr(i,j-i));
  847. i=j+1; j=j+1; while (line.at(j) != '}') {j++;} h.peaks = convert2peaks(line.substr(i,j-i));
  848.  
  849. //cout << "exiting"; exit(0);
  850. return h;
  851. }
  852.  
  853. vector<double> convert2mods(string token) {
  854. vector<double> mods;
  855. //cout << "converting to mod vector<double>: " << token << endl;
  856. unsigned i=1; unsigned j=2; // first char is '{'
  857. while (j < token.size()) { // for each mod mass
  858. if (token.at(j) != ',' || j < token.size()-2) {j++;}
  859. string tmp = token.substr(i,j-i-1); //cout << "modmass: " << tmp << endl;
  860. double modmass = convert2double(tmp);
  861. mods.push_back(modmass);
  862. i=j; j++;
  863.  
  864. }
  865. return mods;
  866. }
  867.  
  868. vector<fragmatch> convert2peaks(string token) {
  869. vector<fragmatch> matches;
  870. // first break the line into individual matches
  871. vector<string> lines;
  872. unsigned i=0; unsigned j=0;
  873. while (j < token.size() && i < token.size()) {
  874. while (token.at(i) != '[') {i++;}
  875. while (token.at(j) != ']') {j++;}
  876. //cout << "pushing back: " << token.substr(i+1,j-i-1) << endl;
  877. string strfrag = token.substr(i+1,j-i-1); // cout << "parts: " << strfrag << endl;
  878. unsigned k=0;
  879. vector<unsigned> commaPos; // positions of the commas in the string
  880. fragmatch f;
  881. while (k < strfrag.size()) {if (strfrag.at(k) == ',') {commaPos.push_back(k);} k++;}
  882.  
  883. /*cout << token.at(i) << endl;
  884. cout << "commaPos: "; for (unsigned p=0; p < commaPos.size(); p++) {cout << commaPos.at(p) << ',';} cout << endl;
  885. cout << "intensity: " << strfrag.substr(0,commaPos.at(0)) << endl;
  886. cout << "obsMz: " << strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1) << endl;
  887. cout << "expMZ: " << strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1) << endl;
  888. cout << "fcharge: " << strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1) << endl;
  889. cout << "type: " << strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1) << endl;
  890. cout << "desc: " << strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1) << endl;
  891. cout << "fragLOD: " << strfrag.substr(commaPos.at(5)+1) << endl;*/
  892.  
  893. f.intensity = convert2int(strfrag.substr(0,commaPos.at(0)));
  894. f.obsmz = convert2double(strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1));
  895. f.expmz = convert2double(strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1));
  896. f.fcharge = convert2int(strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1));
  897.  
  898. if (f.expmz == 0.0) {i=j; j++; continue;} // missing peak isn't of any use at the moment.
  899.  
  900. f.fragType = strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1);
  901. f.fragDesc = strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1);
  902. f.fragLOD = convert2double(strfrag.substr(commaPos.at(5)+1));
  903.  
  904. // calculate the fragmentation site. Zero is the whole peptide and we go from N to C term
  905. if (f.fragType.at(0) == 'i' || f.fragType.at(0) == 'n') {f.fragSite = 0;} // no frag or immonium
  906. else { // x/y/z or a/b/c ion
  907. if (f.fragType.at(0) == 'y') {
  908. unsigned site=1;
  909. while (f.fragDesc.at(site) == '.') {site++;}
  910. f.fragSite = site; //continue;
  911. }
  912. if (f.fragType.at(0) == 'a' || f.fragType.at(0) == 'b') {
  913. unsigned site=1;
  914. while (f.fragDesc.at(site) != '.') {site++;}
  915. f.fragSite = site; //continue;
  916. }
  917. if (f.fragType.at(0) != 'a' && f.fragType.at(0) != 'b' && f.fragType.at(0) != 'y') {
  918. cout << "fragType " << f.fragType << " not yet implemented" << endl;
  919. i=j; j++; continue; // does not record frag site
  920. }
  921.  
  922. }
  923.  
  924. matches.push_back(f); // add this peak
  925.  
  926. i=j; j++; // continue onto next fragment.
  927. }
  928. return matches;
  929. }
  930.  
  931. double convert2double(string str) {
  932. char * cstr;
  933. cstr = new char [str.size()+1]; // convert to c_str
  934. strcpy_s (cstr, str.size()+1, str.c_str());
  935. double value = atof(cstr); //convert to double
  936. return value;
  937. }
  938. int convert2int(string str) {
  939. char * cstr;
  940. cstr = new char [str.size()+1]; // convert to c_str
  941. strcpy_s (cstr, str.size()+1, str.c_str());
  942. int value = atoi(cstr); //convert to int
  943. return value;
  944. }
  945. double encodeAA(char aa) { // encodes AA letter names into numbers
  946. if (aa == 'X') {return 0.00;} // if there is no AA related here, ie N-term side of a b1 ion
  947. if (aa == 'V') {return 0.00;} // hydrophobic side groups
  948. if (aa == 'L') {return 0.05;}
  949. if (aa == 'I') {return 0.10;}
  950. if (aa == 'M') {return 0.15;}
  951. if (aa == 'F') {return 0.20;}
  952. if (aa == 'G') {return 0.25;} // in between
  953. if (aa == 'A') {return 0.30;}
  954. if (aa == 'S') {return 0.35;}
  955. if (aa == 'T') {return 0.40;}
  956. if (aa == 'Y') {return 0.45;}
  957. if (aa == 'W') {return 0.50;}
  958. if (aa == 'C') {return 0.55;}
  959. if (aa == 'P') {return 0.60;}
  960. if (aa == 'N') {return 0.65;} // hydrophilic side groups
  961. if (aa == 'E') {return 0.70;}
  962. if (aa == 'Q') {return 0.75;}
  963. if (aa == 'D') {return 0.80;}
  964. if (aa == 'H') {return 0.85;}
  965. if (aa == 'K') {return 0.90;}
  966. if (aa == 'R') {return 0.95;}
  967. //std::cout << "EncodeAA -> unknown AA letter code: " << aa << std::endl;
  968. return 0.0; // separate value for unknown
  969.  
  970. };
  971. double countHKR(string& seq) { // count the number of H, K or R residues in the input sequence
  972. double HKR=0;
  973. for (unsigned i=0; i<seq.size(); i++) {
  974. char aa = seq.at(i);
  975. if (aa == 'H') {HKR++; continue;}
  976. if (aa == 'K') {HKR++; continue;}
  977. if (aa == 'R') {HKR++; continue;}
  978. }
  979. return HKR;
  980. }
  981. double getMean(vector<double>& v) { // returns the mean
  982. double total=0;
  983. for (unsigned i=0; i < v.size(); i++) {total+= v.at(i);}
  984. return total/v.size();
  985. }
  986. double getStdDev(vector<double>& v, double mean){ // returns the standard deviation of input array, second arg is the precomputed vector mean.
  987. double numerator=0;
  988. for (unsigned i=0; i < v.size(); i++) {numerator+= pow((v.at(i)-mean),2);}
  989. return pow(numerator/v.size(),0.5);
  990. }
  991.  
  992.  
  993. /*
  994. EXAMPLE OF AN OUTPUT FILE WHICH MUST BE PROCESSED AND LEARNED:
  995.  
  996. Search Parameters:
  997. parentTol=1
  998. tol=0.5
  999. minBasicHits=4
  1000. minPrelimScore=0.05
  1001. randomSamples=25
  1002. randomLODspread=20
  1003. reportMin=5
  1004. topCandidates=0.3
  1005. maxTopCandidates=20
  1006. newmetric=0
  1007. maxFragCharge=2
  1008. verbosity=1
  1009. devsAboveMeanForHit=2
  1010. collectRandomHits=0
  1011. MOD0 = iodacetamide (static) mass: 57 acting on: C
  1012. MOD1 = oxidation (variable) mass: 16 acting on: M H W
  1013. MOD2 = deamidation (variable) mass: 1 acting on: N Q
  1014.  
  1015. MOD0 MOD1 MOD2 Delta
  1016. 0 0 0 0
  1017. 0 1 0 16
  1018. 0 2 0 32
  1019. 1 0 0 57
  1020. 1 1 0 73
  1021. 1 2 0 89
  1022. 2 0 0 114
  1023. 2 1 0 130
  1024. 2 2 0 146
  1025. loading peptide database: Forward Db: 1689106 unique peptides loaded from peptides.dat
  1026. loading peptide database: Reverse Db: 683090 unique peptides loaded from reverse.dat
  1027. DTAFiles\AGTEPSDAIR.mgf.dta
  1028. top ten candidates for pmass: 1015.49 DTAFiles\AGTEPSDAIR.mgf.dta
  1029. AGTEPSDAIR prelim: 0.652732 LOD: 25.9144
  1030. ASAEQSGGGPR prelim: 0.498535 LOD: 23.7549
  1031. ASAVSELSPR prelim: 0.630028 LOD: 23.7549
  1032. TFAHNSALR prelim: 0.559211 LOD: 22.6751
  1033. APQFTILAR prelim: 0.651944 LOD: 22.6751
  1034. ASAAPPHPLR prelim: 0.507099 LOD: 20.5156
  1035. TAEEKVSPR prelim: 0.501239 LOD: 19.4358
  1036. EDSVLNIAR prelim: 0.502028 LOD: 19.4358
  1037. TTSISPALAR prelim: 0.597014 LOD: 19.4358
  1038. TGDPQETLR prelim: 0.523944 LOD: 18.356
  1039. random mean: 9.58833 random StdDev: 2.37195
  1040. std Devs above mean: 6.88297
  1041. IDENTIFIED: AGTEPSDAIR
  1042. observed pmass: 1015.49 pcharge: 1
  1043. identified pmass: 1015.48
  1044. mods: {0,0,0,0,0,0,0,0,0,0,}
  1045. basicHits: 8
  1046. prelimScore: 0.652732
  1047. Top 4N peaks: AGTEPSDAIR
  1048. intensity obsMZ expMZ type seq LOD
  1049.   2265 86.10 86.08 immo I 1.080
  1050.   2049 70.07 70.05 immo P 1.080
  1051.   1672 175.13 175.10 y1+ .........R 2.160
  1052.   1532 359.26 359.15 b1+ AGTE...... 2.160
  1053.   816 658.43 658.26 b1+ AGTEPSD... 2.160
  1054.   751 74.07 74.05 immo T 1.080
  1055.   746 87.09 0.00 0.000
  1056.   576 112.09 0.00 0.000
  1057.   460 69.88 70.05 immo P 1.080
  1058.   453 102.06 102.04 immo E 1.080
  1059.   390 85.84 86.08 immo I 1.080
  1060.   390 110.09 0.00 0.000
  1061.   385 84.07 0.00 0.000
  1062.   382 371.19 0.00 0.000
  1063.   378 300.15 0.00 0.000
  1064.   370 72.09 72.04 b1+ A......... 2.160
  1065.   355 230.15 230.11 b1+ AGT....... 2.160
  1066.   348 159.11 0.00 0.000
  1067.   322 185.12 0.00 0.000
  1068.   321 101.09 101.06 a1+ AG........ 1.080
  1069.   318 641.39 641.33 y1+-1NH3 ....PSDAIR 1.080
  1070.   273 120.09 0.00 0.000
  1071.   264 71.88 72.04 b1+ A......... 2.160
  1072.   251 158.11 158.10 y1+-1NH3 .........R 1.080
  1073.   242 157.13 0.00 0.000
  1074.   229 109.71 0.00 0.000
  1075.   210 136.09 0.00 0.000
  1076.   209 129.11 129.06 b1+ AG........ 2.160
  1077.   200 83.81 0.00 0.000
  1078.   200 119.66 0.00 0.000
  1079.   199 212.13 212.11 b1+-1H2O AGT....... 1.080
  1080.   194 130.11 0.00 0.000
  1081. overall LOD score: 25.9144
  1082.  
  1083. DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
  1084. loading obs spect: 0
  1085. top ten candidates for pmass: 2882.55 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
  1086. ALIPSSQNEVELGELLLSLNYLPSAGR prelim: 0.252294 LOD: 66.5238
  1087. QMNLNPMDSPHSPISPLPPTLSPQPR prelim: 0.10953 LOD: 34.8458
  1088. GETLGNIPLLAIGPAICLPGIAAIALARK prelim: 0.0974537 LOD: 26.9263
  1089. VSSDPAWAVEWIELPRGLSLSSLGSAR prelim: 0.0921176 LOD: 25.3424
  1090. DAAIPAPHAPSAPGCLASVKPGGWKGAAGR prelim: 0.0996068 LOD: 25.3424
  1091. MEDTDFVGWALDVLSPNLISTSMLGR prelim: 0.0916495 LOD: 23.7585
  1092. MALLLLSLGLSLIAAQEFDPHTVMQR prelim: 0.0923048 LOD: 23.7585
  1093. QALTPEQVSFCATHMQQYMDPRGR prelim: 0.0931474 LOD: 23.7585
  1094. NTSIRVADFGSATFDHEHHTTIVATR prelim: 0.0745179 LOD: 22.1746
  1095. DLLEDKILHSHLVDYFPEFDGPQR prelim: 0.107471 LOD: 22.1746
  1096. random mean: 11.9109 random StdDev: 4.05973
  1097. std Devs above mean: 13.4523
  1098. IDENTIFIED: ALIPSSQNEVELGELLLSLNYLPSAGR
  1099. observed pmass: 2882.55 pcharge: 1
  1100. identified pmass: 2882.47
  1101. mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
  1102. basicHits: 18
  1103. prelimScore: 0.252294
  1104. Top 4N peaks: ALIPSSQNEVELGELLLSLNYLPSAGR
  1105. intensity obsMZ expMZ type seq LOD
  1106.   3138 95.95 0.00 0.000
  1107.   307 487.20 487.24 y1+ ......................PSAGR 3.168
  1108.   279 86.09 86.08 immo L 1.584
  1109.   253 112.09 0.00 0.000
  1110.   237 763.35 763.38 y1+ ....................YLPSAGR 3.168
  1111.   198 175.09 175.10 y1+ ..........................R 3.168
  1112.   174 600.31 600.32 y1+ .....................LPSAGR 3.168
  1113.   171 70.05 70.05 immo P 1.584
  1114.   155 470.23 470.24 y1+-1NH3 ......................PSAGR 1.584
  1115.   151 1716.85 0.00 0.000
  1116.   141 1416.73 1416.77 y1+ ..............LLLSLNYLPSAGR 3.168
  1117.   141 1715.73 1715.91 y1+ ...........LGELLLSLNYLPSAGR 3.168
  1118.   133 232.10 232.12 y1+ .........................GR 3.168
  1119.   119 400.13 0.00 0.000
  1120.   108 990.44 990.50 y1+ ..................LNYLPSAGR 3.168
  1121.   104 185.09 185.12 b1+ AL......................... 3.168
  1122.   103 542.20 0.00 0.000
  1123.   102 877.33 877.42 y1+ ...................NYLPSAGR 3.168
  1124.   98 303.16 303.16 y1+ ........................AGR 3.168
  1125.   97 1303.62 1303.69 y1+ ...............LLSLNYLPSAGR 3.168
  1126.   96 1943.86 1944.02 y1+ .........VELGELLLSLNYLPSAGR 3.168
  1127.   95 471.21 0.00 0.000
  1128.   95 871.25 0.00 0.000
  1129.   92 1077.48 1077.53 y1+ .................SLNYLPSAGR 3.168
  1130.   80 742.28 0.00 0.000
  1131.   73 298.18 298.20 b1+ ALI........................ 3.168
  1132.   71 643.19 0.00 0.000
  1133.   71 860.35 860.42 y1+-1NH3 ...................NYLPSAGR 1.584
  1134.   66 644.09 644.37 b1+-1NH3-2H2O ALIPSSQ.................... 1.584
  1135.   66 743.40 0.00 0.000
  1136.   63 756.27 0.00 0.000
  1137.   60 187.09 0.00 0.000
  1138.   58 391.19 0.00 0.000
  1139.   58 449.25 0.00 0.000
  1140.   58 2588.10 0.00 0.000
  1141.   57 655.32 0.00 0.000
  1142.   57 1602.76 1602.83 y1+ ............GELLLSLNYLPSAGR 3.168
  1143.   56 497.16 0.00 0.000
  1144.   54 228.08 0.00 0.000
  1145.   54 859.82 859.42 y1+-1H2O ...................NYLPSAGR 1.584
  1146.   53 288.63 0.00 0.000
  1147.   53 289.17 0.00 0.000
  1148.   51 243.16 0.00 0.000
  1149.   51 320.48 0.00 0.000
  1150.   51 409.64 0.00 0.000
  1151.   47 2072.96 2073.06 y1+ ........EVELGELLLSLNYLPSAGR 3.168
  1152.   46 1843.89 0.00 0.000
  1153.   40 2187.00 2187.10 y1+ .......NEVELGELLLSLNYLPSAGR 3.168
  1154.   36 2167.81 0.00 0.000
  1155.   36 2404.79 0.00 0.000
  1156.   1655 96.07 0.00 0.000
  1157.   974 96.18 0.00 0.000
  1158.   958 95.89 0.00 0.000
  1159. overall LOD score: 66.5238
  1160.  
  1161. DTAFiles\AQMDSSQLIHCR.mgf.dta
  1162. top ten candidates for pmass: 1444.43 DTAFiles\AQMDSSQLIHCR.mgf.dta
  1163. FDHNAIMKILDL prelim: 0.272246 LOD: 23.6381
  1164. MAEPGHSHHLSAR prelim: 0.381321 LOD: 21.0117
  1165. MPHLRLVLPITR prelim: 0.299034 LOD: 19.6984
  1166. VMSKSAADIIALAR prelim: 0.295296 LOD: 19.6984
  1167. IEHSTFDGLDRR prelim: 0.361541 LOD: 18.3852
  1168. MESDRLIISLPR prelim: 0.217111 LOD: 17.072
  1169. IDAMHGVMGPYVR prelim: 0.284083 LOD: 17.072
  1170. WQLNPGMYQHR prelim: 0.317101 LOD: 17.072
  1171. GTHPAIHDILALR prelim: 0.294206 LOD: 17.072
  1172. SPSSLLHDPALHR prelim: 0.294206 LOD: 15.7588
  1173. random mean: 10.2432 random StdDev: 3.34294
  1174. std Devs above mean: 4.00694
  1175. IDENTIFIED: FDHNAIMKILDL
  1176. observed pmass: 1444.43 pcharge: 1
  1177. identified pmass: 1444.66
  1178. mods: {0,0,0,0,0,0,16,0,0,0,0,0,}
  1179. basicHits: 5
  1180. prelimScore: 0.272246
  1181. Top 4N peaks: FDHNAIMKILDL
  1182. intensity obsMZ expMZ type seq LOD
  1183.   2157 1000.35 0.00 0.000
  1184.   1812 110.06 110.00 immo H 1.313
  1185.   1360 175.08 0.00 0.000
  1186.   1032 200.09 0.00 0.000
  1187.   922 455.11 0.00 0.000
  1188.   830 86.08 86.08 immo I 1.313
  1189.   648 101.06 101.09 immo K 1.313
  1190.   616 472.15 0.00 0.000
  1191.   552 70.06 0.00 0.000
  1192.   504 335.09 0.00 0.000
  1193.   492 89.06 0.00 0.000
  1194.   383 112.08 0.00 0.000
  1195.   330 585.21 585.18 b1+ FDHNA....... 2.626
  1196.   326 133.02 0.00 0.000
  1197.   298 1289.38 0.00 0.000
  1198.   284 229.11 0.00 0.000
  1199.   279 87.04 87.04 immo N 1.313
  1200.   265 913.36 0.00 0.000
  1201.   263 826.31 0.00 0.000
  1202.   254 104.05 104.04 immo M 1.313
  1203.   240 698.29 698.26 b1+ FDHNAI...... 2.626
  1204.   240 748.23 748.40 y1+ ......MKILDL 2.626
  1205.   239 298.05 0.00 0.000
  1206.   236 334.07 0.00 0.000
  1207.   230 166.04 0.00 0.000
  1208.   228 183.07 0.00 0.000
  1209.   223 247.07 247.11 y1+ ..........DL 2.626
  1210.   221 120.10 120.07 immo F 1.313
  1211.   221 1424.88 0.00 0.000
  1212.   213 1325.48 0.00 0.000
  1213.   210 329.15 0.00 0.000
  1214.   204 731.14 731.40 y1+-1NH3 ......MKILDL 1.313
  1215.   203 1354.55 0.00 0.000
  1216.   201 316.08 0.00 0.000
  1217.   201 1008.42 0.00 0.000
  1218.   199 568.22 568.18 b1+-1NH3 FDHNA....... 1.313
  1219.   196 639.28 0.00 0.000
  1220.   188 366.20 0.00 0.000
  1221.   186 84.04 0.00 0.000
  1222.   184 203.04 0.00 0.000
  1223.   181 331.14 0.00 0.000
  1224.   181 855.49 0.00 0.000
  1225.   177 217.08 0.00 0.000
  1226.   177 791.28 0.00 0.000
  1227.   174 318.11 0.00 0.000
  1228.   170 216.07 0.00 0.000
  1229.   167 861.24 861.48 y1+ .....IMKILDL 2.626
  1230.   165 531.15 0.00 0.000
  1231. overall LOD score: 23.6381
  1232.  
  1233. DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
  1234. top ten candidates for pmass: 1908.96 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
  1235. ATAAPAGAPPQPQDLEFTK prelim: 0.464155 LOD: 54.2672
  1236. KPTNLGTSCLLRHLQR prelim: 0.338948 LOD: 27.1336
  1237. APQPQRSASPALPFWTR prelim: 0.201213 LOD: 24.6669
  1238. STGWYPQPQINWSDSK prelim: 0.355488 LOD: 24.6669
  1239. ALQSAGPFGPYIPGGPAPGR prelim: 0.303074 LOD: 23.4336
  1240. QVSAESSSSMNSNTPLVR prelim: 0.205814 LOD: 23.4336
  1241. GKQSPHHERPEPEPPR prelim: 0.211954 LOD: 22.2002
  1242. ALRPKPTLEKDLEEDR prelim: 0.218912 LOD: 20.9669
  1243. IVEVLGIPPAHILDQAPK prelim: 0.343929 LOD: 20.9669
  1244. GQELHKPETGCQQELR prelim: 0.197297 LOD: 20.9669
  1245. random mean: 12.1361 random StdDev: 4.61053
  1246. std Devs above mean: 9.138
  1247. IDENTIFIED: ATAAPAGAPPQPQDLEFTK
  1248. observed pmass: 1908.96 pcharge: 1
  1249. identified pmass: 1908.95
  1250. mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
  1251. basicHits: 15
  1252. prelimScore: 0.464155
  1253. Top 4N peaks: ATAAPAGAPPQPQDLEFTK
  1254. intensity obsMZ expMZ type seq LOD
  1255.   6560 1299.53 1299.63 y1+ ........PPQPQDLEFTK 2.467
  1256.   5095 63.86 0.00 0.000
  1257.   5056 70.06 70.05 immo P 1.233
  1258.   2954 977.39 977.47 y1+ ...........PQDLEFTK 2.467
  1259.   2364 226.10 226.13 b1+-1H2O ATA................ 1.233
  1260.   2318 323.13 0.00 0.000
  1261.   1331 297.12 297.17 b1+-1H2O ATAA............... 1.233
  1262.   1183 315.12 315.17 b1+ ATAA............... 2.467
  1263.   1105 341.14 0.00 0.000
  1264.   987 663.24 0.00 0.000
  1265.   894 244.08 244.13 b1+ ATA................ 2.467
  1266.   879 86.08 86.08 immo L 1.233
  1267.   833 1595.70 1595.78 y1+ ....PAGAPPQPQDLEFTK 2.467
  1268.   828 101.06 101.06 immo Q 1.233
  1269.   701 611.25 611.32 b1+ ATAAPAGA........... 2.467
  1270.   688 141.08 0.00 0.000
  1271.   668 175.07 0.00 0.000
  1272.   661 169.08 0.00 0.000
  1273.   650 1427.57 1427.69 y1+ ......GAPPQPQDLEFTK 2.467
  1274.   613 269.12 0.00 0.000
  1275.   613 548.20 0.00 0.000
  1276.   596 933.36 933.48 b1+ ATAAPAGAPPQ........ 2.467
  1277.   588 905.35 905.48 a1+ ATAAPAGAPPQ........ 1.233
  1278.   572 1281.52 1281.63 y1+-1H2O ........PPQPQDLEFTK 1.233
  1279.   512 368.15 0.00 0.000
  1280.   495 110.06 0.00 0.000
  1281.   480 394.16 394.22 b1+-1H2O ATAAP.............. 1.233
  1282.   480 1202.47 1202.58 y1+ .........PQPQDLEFTK 2.467
  1283.   454 619.24 619.33 y1+-1H2O ..............LEFTK 1.233
  1284.   447 270.13 0.00 0.000
  1285.   439 776.31 0.00 0.000
  1286.   437 1370.58 1370.67 y1+ .......APPQPQDLEFTK 2.467
  1287.   433 98.05 0.00 0.000
  1288.   414 365.16 0.00 0.000
  1289.   403 72.07 72.04 b1+ A.................. 2.467
  1290.   396 337.15 0.00 0.000
  1291.   394 173.07 173.09 b1+ AT................. 2.467
  1292.   386 243.11 0.00 0.000
  1293.   370 126.03 0.00 0.000
  1294.   367 708.29 708.37 b1+ ATAAPAGAP.......... 2.467
  1295.   360 195.09 0.00 0.000
  1296.   350 248.13 248.14 y1+ .................TK 2.467
  1297.   330 112.06 0.00 0.000
  1298.   330 115.07 0.00 0.000
  1299.   329 212.10 0.00 0.000
  1300.   313 540.22 540.28 b1+ ATAAPAG............ 2.467
  1301.   311 209.06 0.00 0.000
  1302.   311 295.13 0.00 0.000
  1303.   310 167.09 0.00 0.000
  1304.   309 906.38 0.00 0.000
  1305.   303 155.07 155.09 b1+-1H2O AT................. 1.233
  1306.   299 382.15 0.00 0.000
  1307.   293 186.09 0.00 0.000
  1308.   278 74.05 74.05 immo T 1.233
  1309.   278 1052.46 0.00 0.000
  1310.   273 583.21 583.32 a1+ ATAAPAGA........... 1.233
  1311.   264 1088.41 1088.53 y1+-1NH3 ..........QPQDLEFTK 1.233
  1312.   262 355.12 0.00 0.000
  1313.   259 102.06 102.04 immo E 1.233
  1314.   245 338.15 0.00 0.000
  1315.   242 453.19 0.00 0.000
  1316.   239 520.24 0.00 0.000
  1317.   237 452.16 0.00 0.000
  1318.   232 198.10 0.00 0.000
  1319. overall LOD score: 54.2672
  1320.  
  1321. */
  1322.  
  1323. Flood::Vector<double> getAttributes(fragmatch& f,string seq, unsigned pcharge) {
  1324.  
  1325. //if (params.verbosity < 5) {cout << "getting attributes for frag " << f.type << " " << f.desc << endl; }
  1326. // given the seq, and the immonium fragment, returns the attribute vector to be fed into the neural net
  1327. vector<double> attributes; // holds the attributes for this fragment
  1328.  
  1329.  
  1330. // NO FRAGMENTATION - TYPE 0
  1331. if (f.fragType.substr(0,6).compare("nofrag")==0) {
  1332. attributes.push_back(0); // iontype 0 = unfragmented
  1333. attributes.push_back(pcharge);
  1334. attributes.push_back(f.fcharge);
  1335. attrib

Report this snippet  

You need to login to post a comment.