Posted By

leomchugh on 11/01/10


Tagged

learning Pinnacle Machine Harvest


Versions (?)

PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package


 / Published in: C++
 

Harvest paper: Citation: McHugh L, Arthur JW (2008) Computational methods for protein identification from mass spectrometry data. PLoSComputBiol 4(2): e12. doi:10.1371/journal.pcbi.0040012

PINNACLE automatically learns the statistical and neural network models for the Harvest algorithm.

Contact [email protected] for a Visual Studio project solution.

  1. /* Learner.cpp :
  2. Copyright (c) 2009, Leo McHugh, University of Sydney, Australia
  3. All rights reserved.
  4.  
  5. Redistribution and use in source and binary forms, with or without
  6. modification, are permitted provided that the following conditions are met:
  7.   * Redistributions of source code must retain the above copyright
  8.   notice, this list of conditions and the following disclaimer.
  9.   * Redistributions in binary form must reproduce the above copyright
  10.   notice, this list of conditions and the following disclaimer in the
  11.   documentation and/or other materials provided with the distribution.
  12.   * Neither the name of the <organization> nor the
  13.   names of its contributors may be used to endorse or promote products
  14.   derived from this software without specific prior written permission.
  15.  
  16. THIS SOFTWARE IS PROVIDED BY LEO MCHUGH ''AS IS'' AND ANY
  17. EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  18. WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  19. DISCLAIMED. IN NO EVENT SHALL LEO MCHUGH BE LIABLE FOR ANY
  20. DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  21. (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  22. LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
  23. ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  24. (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  25. SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  26.  
  27. This code requires input files from PreProcess.cpp
  28.  
  29. updates of this code can be found at: http://www.usyd.edu.au/sydneybioinformatics/research/software.shtml
  30.  
  31. Contact: [email protected]
  32.  
  33. */
  34.  
  35. /*
  36. PURPOSE AND USE OF THIS SOFTWARE:
  37.  
  38. This software is designed for use with PINNACLE. Once PINNACLE is run, it generates a list of hits along with a
  39. confidence score for each, and prints out a list of identified fragments. An example is shown as a commented out
  40. section at the very end of this file.
  41. Learner.cpp uses these files to gather statistical information about fragmentation and to build and train a
  42. neural network.
  43.  
  44. For version1.0, only limited information will be collected.
  45.  
  46. We calculate approximate probability functions for:
  47. the parent mass error w.r.t mass
  48. fragment mass erro w.r.t mass
  49. probability of fragment identification in correct assignments w.r.t mass / charge
  50.  
  51. Neural Nets:
  52.  
  53. Inputs for predicting peak intensity for a fragment:
  54. iontype either no frag, immoium, y or b ion (nofrag, immo, y1, a1, b1) = (0,1,3,2,4); // 5 reserved for z ions but not implemented
  55. pcharge parent charge
  56. fcharge fragment charge
  57. plen the length of the parent peptide
  58. prop_len the proportion of the distance from N to C term that the fragmentation site was at
  59. pr_hkr proportion of HKR and R residues in the fragment over those in the unbroken peptide
  60. n_x identity of AA on the n-term side of cleavage
  61. c_x identity..... c-term
  62.  
  63. inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
  64.  
  65. (taken from working notes07)
  66.  
  67. ==== Add later============
  68. loss 0 if it's a 'native' ion, such as a y or b, 1 if it contains losses.
  69.  
  70. Output:
  71. log of the intensity, scaled so that the max log(intensity) for the fragment is set to 100 // maybe use later 6 Fuzzy Logic NN outputs, 0, 20, 40 etc to 100.
  72.  
  73.  
  74. */
  75.  
  76. #include "stdafx.h"
  77. #include <iostream>
  78. #include <string>
  79. #include <fstream>
  80. #include <iostream>
  81. #include <sstream>
  82. #include <vector>
  83. #include <list>
  84. #include <time.h> // for random number generation
  85.  
  86. // Learner based libraries:
  87. #include "../Utilities/Vector.h"
  88. #include "../Utilities/Matrix.h"
  89. #include "../Utilities/InputTargetDataSet.h"
  90. #include "../MultilayerPerceptron/MultilayerPerceptron.h"
  91. #include "../ObjectiveFunctional/MeanSquaredError.h"
  92. #include "../TrainingAlgorithm/QuasiNewtonMethod.h"
  93.  
  94.  
  95. using namespace std;
  96.  
  97. // ======================= FIXED PARAMETERS =====================
  98. unsigned const MAX_SEQ_SIZE = 50; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
  99. unsigned const MIN_SEQ_SIZE = 5; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
  100. unsigned outputNodes = 1; // The number of output nodes in the fuzzy network
  101. unsigned inputNodes = 8; // described above
  102. double propTesting = 0.2; // Proportion of peptides reserved for testing.
  103. double minZscore = 3.7; // minimum required Zscore for a 'hit' to be considered confident enough to be used for testing/training
  104. unsigned const buckets = 50; // mass spectrum division into this many buckets of 100 da. for statistical learning part
  105. bool fuzzyOutputs=false; // if true, splits the output to a fuzzy output, if false, there is a single output node.
  106.  
  107. // ======================= OBJECTS ==============================
  108. // ============== MUST BE THE SAME AS IN PINNACLE.CPP ===========
  109. struct fragmatch { // records a matching fragment in from an indentified peptide
  110. unsigned intensity;
  111. double obsmz;
  112. double expmz; // expected mz
  113. int fcharge; // fragment charge
  114. string fragType; // ie y1+
  115. string fragDesc; // decription of the fragment. ie: ........NQLLQFR
  116. unsigned fragSite; // from N to c term. zero is no frag, 1 is first bion / last y ion etc,
  117. double fragLOD; // LOD for this frag
  118. };
  119. struct format // converts a hit into a neural net input set
  120. {
  121. double pmasserror; // error in the pmass calculation
  122. vector<vector<double>> fmzerror; // fragment mass error array <fmass,error>
  123. vector<vector<double>> NNlines; // formatted lines to be appended to the NN training matrix.
  124. };
  125.  
  126. struct hit {
  127. string seq; // sequence of the matched peptide
  128. double Zscore; // number of std devs above the random mean
  129. int pcharge; // charge of the observed parent peptide
  130. int basichits; // the number of no loss y and b ions
  131. double prelimScore; // the proportion of the top 4N peaks covered by no loss y and b ions.
  132. double obspmass; // observed pmass of the input spectrum (uncharged)
  133. double exppmass; // theoretical pmass of the matched peptide (uncharged)
  134. vector<fragmatch> peaks; // contains the top 4N peaks
  135. vector<double> mods; // one entry for each AA from N to C term
  136. double overallLOD; // LOD score generated by PINNACLE for this peptide match
  137. };
  138. struct learnedParameters { // holds the parameters learned by this file for passing to PINNACLE
  139. // THIS STRUCTURE PASSES ANY LEARNED PARAMETERS TO PINNACLE
  140. // IF YOU WANT TO INCLUDE MORE STATISTICAL LEARNING, ADD THE LEARNED PARAMETERS
  141. // TO THIS OBJECT, AND UPDATE PINNACLE SO THAT IT CAN LOAD THE SAME OBJECT
  142. // THIS OBJECT IS WRITTEN TO BINARY FILE AFTER LEARNING SO THAT PINNACLE CAN READ IT IN
  143. double pmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for parent mass errors
  144. double fmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for fragment mass errors
  145.  
  146. // FEEL FREE TO ADD MORE LEARNED PARAMETERS IN HERE (REMEMBERING TO UPDATE THE SAME OBJECT IN PINNACLE
  147.  
  148. };
  149. // =================================== FILE LOCATIONS ================================================================
  150. // location of the output of the PINNACLE run (this is a serial file writen by PINNACLE storing all the hits as objects)
  151. char *inputFile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\output.ser"; // .ser (serial) - see PINNACLE for format details
  152.  
  153. // the below files are all outputs from this program. savedModel and learnedParameters store the neural network and
  154. // other general learned parameters respectively and these 2 files must be written to the working directory of PINNACLE
  155.  
  156. // neural network related
  157. char *tempTrainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\tempTrainingSet.dat"; // just contains instances
  158. char *trainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingSet.dat";
  159. char *testingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\testingSet.dat";
  160. char *trainingHistory = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingHistory.dat";
  161. char *savedModel = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Debug\\multilayerPerceptron.dat";
  162.  
  163. // other learned parameters (see object for details)
  164. char *learnedParams = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\learnedParameters.bin"; // binary
  165.  
  166. // ===================================== FUNCTIONS ===========================
  167. // LOAD IN AND DATA PREPARATION
  168. void createTrainingFile();
  169. hit loadInHit(ifstream&); // loads in the next hit
  170. vector<string> processHit(hit, bool); // returns a block of lines which are converted into a NN input file (Flood style)
  171. double getMean(vector<double>&); // returns the mean
  172. double getStdDev(vector<double>&, double); // returns the standard deviation of input array, second arg is the precomputed vector mean.
  173. void learnMassErrors(); // learns parent and fragment mass errors for this dataset
  174.  
  175. // TESTING OF THE MODEL
  176. void loadAndProcessTrainingFile();
  177. void testModel(); // compares against a flat spectrum.
  178. double dotProduct(vector<double>, vector<double>); // return inner dot product of input vectors
  179. double testPeptide(vector<double>); // returns the dot product of a flat and the predicted spect.
  180. fragmatch findMatch(string, unsigned, hit); // (iontype, breakSite,matchArray)
  181. vector<double> fuzzyOut(float, float); // produces a fuzzy output of the intensity
  182. void trainingSetHeader(ofstream&,int,int,int,string,string);// sets up the header for the output training file
  183.  
  184. // OTHER HANDY FUNCTIONS
  185. vector<double> normalise(vector<double>, double sumTotal); // takes an input vector and normalises to sum to 100
  186. vector<double> convert2mods(string); //
  187. vector<double> convert2doubleVect(string); // converts a space separated line into a vector of doubles (from test set)
  188. vector<fragmatch> convert2peaks(string); // used during loading in to convert .ser file to a hit object
  189. double convert2double(string); // converts a string to a double
  190. int convert2int(string); // converts a string to an int
  191. string convert2str(vector<double>); // converts doubles into a string for output
  192. double encodeAA(char);
  193. double countHKR(string&);
  194. string stringify(double x); // convert double to string
  195.  
  196. // Neural network
  197. Flood::Vector<double> getAttributes(fragmatch& fm,string seq, unsigned pcharge);
  198. Flood::MultilayerPerceptron ANN;
  199. void printFloodVector(Flood::Vector<double>& v);
  200.  
  201.  
  202. // ============== DEBUGGING ===========================
  203. int verbosity = 4;
  204.  
  205. int _tmain(int argc, _TCHAR* argv[])
  206. {
  207. // NEURAL NETWORK LEARNING
  208. // reads the identified peptides and creates a NN training file
  209. // createTrainingFile();
  210.  
  211. // imports the processed training file, sets the neural net parameters for the model and learning
  212. // loadAndProcessTrainingFile();
  213.  
  214. // tests neural network model against the testing file generated by createTrainingFile.
  215. testModel();
  216.  
  217. // STATISTICAL MACHINE LEARNING
  218. // collects basic stats on hits to optimise search parameters.
  219. // learnMassErrors(); // collects stats on parent and fragment mass erros.
  220.  
  221.  
  222. return 0;
  223. }
  224. void learnMassErrors() { // learn the mass errors from the hits
  225. cout << "Learning from parent and fragment mass errors from: " << inputFile << endl;
  226. ifstream inputHits;
  227. inputHits.open(inputFile); // load in the the input file in binary format
  228.  
  229. learnedParameters lp; // object for storing learned parameters
  230.  
  231. // we divide the mass range up into buckets of 100 Da, storing a mean and variance for each bucket
  232. // PINNACLE will then use this information in a prob density function
  233. vector<vector<double>> pmassbuckets; // stores the parent mass errors for each bin
  234. vector<vector<double>> fmassbuckets; // stores the fragment mass errors for each bin
  235. for (unsigned i=0; i < buckets; i++) {
  236. vector<double> v;
  237. pmassbuckets.push_back(v); // create elements (blank memory spacers for the time being)
  238. fmassbuckets.push_back(v); // create elements
  239. }
  240.  
  241. unsigned hitsLoaded=0;
  242. while (!inputHits.eof()) {
  243. hit h = loadInHit(inputHits);
  244. if (h.seq.compare("END")==0) {cout << "hits loaded: " << hitsLoaded << endl; break;} // no more available (loadInHit returns 'END' at eof)
  245. if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
  246. hitsLoaded++;
  247.  
  248. // RECORD PARENT MASS ERROR
  249. unsigned index = (unsigned) floor(h.obspmass/100); // assign bucket (100 Da per bucket)
  250. if (index > buckets-1) {index = buckets-1;} // prevent overflow
  251. //double modmass =0; // account for mods
  252. //for (unsigned i=0; i < h.mods.size(); i++) {modmass+=h.mods.at(i);}
  253. double difference = h.obspmass-(h.exppmass);
  254. if (fabs(difference) > 1) {
  255. cout << "obs mass: " << h.obspmass << " expmass: " << h.exppmass << endl; // " modmass: " << modmass << endl;
  256. cout << "unexpectedly high parent mass error - skip"; continue; // sometimes a peptide is IDed twice, each with different mods, keep searching..
  257. }
  258. pmassbuckets.at(index).push_back(difference); // record parent mass error
  259.  
  260. // RECORD FRAGMENT MASS ERRORS
  261. for (unsigned i=0; i<h.peaks.size(); i++) {
  262. if (h.peaks.at(i).obsmz == 0) {continue;} // no matching peak
  263. unsigned index = (unsigned) floor(h.peaks.at(i).obsmz/100); // assign bucket (100 Da per bucket)
  264. if (index > buckets-1) {index = buckets-1;} // prevent overflow
  265. // using absolute error values for frag mass due to found bimodal
  266. // distributions for some datasets. non absolute values would be
  267. // better (but can't then use mean and std dev in probability functions
  268. //cout << " fmass: " << h.peaks.at(i).obsmz << " error: " << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;
  269. fmassbuckets.at(index).push_back(fabs(h.peaks.at(i).obsmz-h.peaks.at(i).expmz)); // record parent mass error
  270. //if (fragError.at(1) < 0 && h.peaks.at(i).obsmz > 700) {cout << "UNDER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
  271. //if (fragError.at(1) > 0 && h.peaks.at(i).obsmz > 700) {cout << " OVER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
  272. }
  273. } // end loading in all hits from file
  274. inputHits.close();
  275.  
  276. // SHOW SOME OF THE DISTROS IN THE BUCKETS
  277. unsigned testBucket = 8;
  278. cout << "bucket" << testBucket << "={";
  279. for (unsigned i=0; i < fmassbuckets.at(testBucket).size(); i++) {cout << fmassbuckets.at(testBucket).at(i) << ',';}
  280. cout << "};\n";
  281.  
  282. exit(0);
  283.  
  284. // GET PARAMETERS FOR MEAN AND VARIANCE FOR EACH BUCKET
  285. unsigned minDatapoints = 10; // the minimum number of values in a bucket for the calculation of mean and std dev
  286.  
  287. // ================ PARENT MASS ERRORS ============================================================
  288. vector<vector<double>> pe; // pe = pmass errors for each bucket, stores <mean, stdDev>
  289. for (unsigned bucket=0; bucket < buckets; bucket++) {
  290. vector<double> v;
  291. v.clear();
  292. if (pmassbuckets.at(bucket).size() < minDatapoints) {pe.push_back(v); continue;} // too few values to calculate
  293. double mean = getMean(pmassbuckets.at(bucket));
  294. double stdDev = getStdDev(pmassbuckets.at(bucket), mean);
  295. //cout << "massbucket: " << bucket << ": {";
  296. //for (unsigned j=0; j<pmassbuckets.at(bucket).size(); j++) {cout << pmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
  297. //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
  298. v.push_back(mean); v.push_back(stdDev);
  299. pe.push_back(v);
  300. }
  301.  
  302. // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
  303. // first back-propogate the values to leading empty buckets
  304. for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
  305. if (pe.at(i).empty()) {continue;}
  306. vector<double> v = pe.at(i); // first non-empty bucket
  307. while (1) {pe.at(i) = v; if (i==0) {break;} i--;}
  308. break;
  309. }
  310.  
  311. // now propagate forward any values to any empty buckets
  312. vector<double> v = pe.at(0); // first bucket's value (mean,stdDev)
  313. for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
  314. if (pe.at(i).empty()) {pe.at(i)=v; continue;} // assign same values as bucket to the left
  315. v = pe.at(i); // update
  316. }
  317.  
  318. cout << "pmass errors: \n"; for (unsigned i=0; i < pe.size(); i++) {
  319. if (pe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  320. else {cout << "bucket: " << i << " {" << pe.at(i).at(0) << ',' << pe.at(i).at(1) << "}\n";}
  321. } cout << endl;
  322.  
  323. // load into object
  324. for (unsigned i=0; i < buckets; i++) { // pe = parent errors
  325. lp.pmassErrors[i][0] = pe.at(i).at(0); // mean
  326. lp.pmassErrors[i][1] = pe.at(i).at(1); // standard Deviation
  327. }
  328.  
  329.  
  330. // ================ FRAGMENT MASS ERRORS ============================================================
  331. vector<vector<double>> fe; // fe = fragment errors for each bucket, stores <mean, stdDev>
  332. for (unsigned bucket=0; bucket < buckets; bucket++) {
  333. vector<double> v;
  334. v.clear();
  335. if (fmassbuckets.at(bucket).size() < minDatapoints) {fe.push_back(v); continue;} // too few values to calculate
  336. double mean = getMean(fmassbuckets.at(bucket));
  337. double stdDev = getStdDev(fmassbuckets.at(bucket), mean);
  338. //cout << "fmassbucket: " << bucket << ": {";
  339. //for (unsigned j=0; j<fmassbuckets.at(bucket).size(); j++) {cout << fmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
  340. //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
  341. v.push_back(mean); v.push_back(stdDev);
  342. fe.push_back(v);
  343. }
  344.  
  345. // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
  346. // first back-propogate the values to leading empty buckets
  347. for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
  348. if (fe.at(i).empty()) {continue;}
  349. vector<double> v = fe.at(i); // first non-empty bucket
  350. //cout << "first non-empty fragmass bucket: {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}" << endl;
  351. while (1) {fe.at(i) = v; if (i==0) {break;} i--;}
  352. break;
  353. }
  354.  
  355. // now propagate forward any values to any empty buckets
  356. v = fe.at(0); // first bucket's value (mean,stdDev)
  357. for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
  358. if (fe.at(i).empty()) {fe.at(i)=v; continue;} // assign same values as bucket to the left
  359. v = fe.at(i); // update
  360. }
  361.  
  362. cout << "fragment errors: \n"; for (unsigned i=0; i < fe.size(); i++) {
  363. if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  364. else {cout << "bucket: " << i << " {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}\n";}
  365. } cout << endl;
  366.  
  367. // output for Mathematica graphing - buckets of 100 da {{fragmass,AvgError,StdDev},{...}}
  368. cout << "ferrors={"; for (unsigned i=0; i < fe.size(); i++) {
  369. if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
  370. else {cout << '{' << i*100 << ',' << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "},";}
  371. } cout << '}' << endl;
  372.  
  373.  
  374. // load into object
  375. for (unsigned i=0; i < buckets; i++) { // pe = parent errors
  376. lp.fmassErrors[i][0] = fe.at(i).at(0); // mean
  377. lp.fmassErrors[i][1] = fe.at(i).at(1); // standard Deviation
  378. }
  379.  
  380.  
  381.  
  382. ofstream out; // write object to file
  383. out.open(learnedParams, ios::binary);
  384. out.write((char *)(&lp),sizeof(lp)); // write object out to binary file (used by PINNACLE)
  385. out.close();
  386.  
  387. cout << "quiting after learned parameters writen to file";
  388. exit(0);
  389. }
  390. void testModel() {
  391. // first load in the model
  392. Flood::MultilayerPerceptron ANN; ANN.load(savedModel);
  393. cout << "loading ANN from " << savedModel << endl;
  394.  
  395. ifstream testingSet;
  396. testingSet.open(testingfile);
  397. vector<double> improvements;
  398.  
  399. // load in a single instance from the test set
  400. string line;
  401.  
  402. while (!testingSet.eof()) {
  403. vector<vector<double>> instance; // instance of the matching hits for a given peptide
  404. instance.clear();
  405. line = "NULL";
  406. while (!line.empty() && !testingSet.eof()) {
  407. getline(testingSet, line );
  408. if (!line.empty()) {
  409. //cout << " pushing back line: " << line << " of size: " << convert2doubleVect(line).size() << endl;
  410. instance.push_back(convert2doubleVect(line));
  411. }
  412. }
  413. //cout << "instance loaded with " << instance.size() << endl;
  414.  
  415. vector<double> predicted;
  416. vector<double> flat(instance.size(),1);
  417. vector<double> observed;
  418.  
  419. for (unsigned i=0; i < instance.size(); i++) {
  420. vector<double> input = instance.at(i);
  421. double obsPeakInt = input.back(); // the correct output
  422. observed.push_back(obsPeakInt);
  423. input.pop_back(); // remove the last element (answer)
  424. // need to convert a normal vector to a flood style vector
  425. Flood::Vector<double> attributes(input.size());
  426. for (unsigned j=0; j<input.size(); j++) {attributes[j] = input.at(j);}
  427.  
  428.  
  429. Flood::Vector<double> fuzzyOutput = ANN.getOutput(attributes);
  430.  
  431. if (fuzzyOutputs) {
  432. // take the highest node as the prediction of intensity (rounded to nearest 20%)
  433. unsigned maxnode=0; double maxval=0;
  434. for (int k=0; k < fuzzyOutput.getSize(); k++) {
  435. if (fuzzyOutput[k] > maxval) {maxnode=k; maxval=fuzzyOutput[k];}
  436. }
  437. predicted.push_back(maxnode); // just take the node with the highest output
  438. }
  439. else {
  440. //cout << "attributes: "; for (int k=0; k < attributes.getSize(); k++) {cout << attributes[k] << ' ';}
  441. //cout << "predicted output: " << fuzzyOutput[0] << endl;
  442. predicted.push_back(fuzzyOutput[0]); // just the raw single number coming out.
  443. //double exponent = pow(2.17,fuzzyOutput[0]);
  444. //predicted.push_back(exponent); // blow it back up
  445. }
  446.  
  447.  
  448. // print out the fuzzy nodes, max node and observed intensity
  449. /*cout << "fuzzyOut " << i << ": ";
  450. for (int k=0; k < fuzzyOutput.getSize(); k++) {
  451. char buffer [50];
  452. int n, a=5, b=3;
  453. n=sprintf_s(buffer, 50, "%0.3f",fuzzyOutput[k]);
  454. cout << buffer << " ";
  455. }cout << "| " << maxnode*0.2 << " | " << obsPeakInt << endl;
  456. */
  457. } // end prediction of each peak height
  458.  
  459. // end of single test peptide
  460. // Normalise vectors
  461. predicted = normalise(predicted,100); // normalises spectrum to a sum total of 100
  462. flat = normalise(flat,100);
  463. observed = normalise(observed,100);
  464.  
  465. // show normalized vectors
  466. cout << "flat : {"; for (unsigned i=0; i < flat.size()-1; i++) {cout << flat.at(i) << ',';} cout << flat.back() << "}\n";
  467. cout << "predicted: {"; for (unsigned i=0; i < predicted.size()-1; i++) {cout << predicted.at(i) << ',';} cout << predicted.back() << "}\n";
  468. cout << "observed : {"; for (unsigned i=0; i < observed.size()-1; i++) {cout << observed.at(i) << ',';} cout << observed.back() << "}\n";
  469.  
  470. double flatDP = dotProduct(observed,flat);
  471. double annDP = dotProduct(observed,predicted);
  472. improvements.push_back(annDP/flatDP);
  473.  
  474. cout << "imp: " << annDP/flatDP << " flatDP: " << flatDP << " annDP: " << annDP << endl << endl;
  475.  
  476. //exit(0);
  477.  
  478. }
  479.  
  480.  
  481. cout << "improvements={"; for (unsigned i=0; i < improvements.size(); i++) {cout << improvements.at(i) << ',';} cout << "};";
  482.  
  483. // use a dot product model.
  484. // for the FOUND peaks, create a spectrum. then normalise
  485.  
  486.  
  487.  
  488. return;
  489. }
  490. double dotProduct(vector<double> v1, vector<double> v2) {
  491. double dp=0;
  492. if (v1.size() != v2.size()) {cout << "vectors not equal length!"; exit(1);}
  493. for (unsigned i=0; i < v1.size(); i++) {
  494. dp += v1.at(i)*v2.at(i);
  495. }
  496. return dp;
  497. }
  498.  
  499. vector<double> normalise(vector<double> v, double sumTotal) {
  500. double sum = 0;
  501. for (unsigned i=0; i < v.size(); i++) {sum+=v.at(i);}
  502. // make sum to 100
  503. double scalar = sumTotal/sum;
  504. for (unsigned i=0; i < v.size(); i++) {v.at(i)*=scalar;}
  505. return v;
  506.  
  507. }
  508. double testPeptide(vector<double> fragmatches) { // returns a dot product for the predicted and a flat spect.
  509.  
  510. return 0;
  511. }
  512. vector<double> convert2doubleVect (string line) { // converts a space separate line into a double vector
  513. unsigned i=0; unsigned j=0;
  514. vector<double> out;
  515. while (j < line.size()) {
  516. if (j == line.size()-1) { // last one
  517. string tmp = line.substr(i,j-i+1); //cout << "element: " << tmp << endl;
  518. double val = convert2double(tmp);
  519. out.push_back(val);
  520. }
  521. if (line.at(j) != ' ' && j < line.size()) {j++; continue; }
  522. string tmp = line.substr(i,j-i); //cout << "element: " << tmp << endl;
  523. double val = convert2double(tmp);
  524. out.push_back(val);
  525. i=j; j++;
  526. }
  527. //exit(0);
  528. return out;
  529.  
  530. }
  531. void loadAndProcessTrainingFile() {
  532.  
  533. // LOAD IN TRAINING SET INTO NN FOR LEARNING>
  534. Flood::InputTargetDataSet inputTargetDataSet ;
  535. inputTargetDataSet.load(trainingfile);
  536.  
  537. // set up a new neural network
  538. Flood::MultilayerPerceptron multilayerPerceptron(inputNodes,inputNodes,outputNodes); // input, middle layer, output - same number in middle layer as inputs
  539. multilayerPerceptron.initFreeParametersNormal(0.0,1.0); // initialise the network within the chosen range
  540.  
  541. // set up the names of the variables (both in and out)
  542. Flood::Vector<string> nameOfInputVariables = inputTargetDataSet.getNameOfInputVariables();
  543. Flood::Vector<string> nameOfTargetVariables = inputTargetDataSet.getNameOfTargetVariables();
  544. multilayerPerceptron.setNameOfInputVariables(nameOfInputVariables);
  545. multilayerPerceptron.setNameOfOutputVariables(nameOfTargetVariables);
  546.  
  547.  
  548.  
  549.  
  550. // get the parameters for the input data - bound in necessary
  551. //Flood::Matrix<double> meanAndStandardDeviationOfInputData = inputTargetDataSet.getMeanAndStandardDeviationOfInputData();
  552. Flood::Matrix<double> minimumAndMaximumOfInputData = inputTargetDataSet.getMinimumAndMaximumOfInputData();
  553.  
  554. Flood::Vector<double> minOfInputVariables(inputNodes); // these values are not inclusive, so min must be lower than any expected input
  555. minOfInputVariables[0] = -0.01; // iontype
  556. minOfInputVariables[1] = 0.99; // pcharge
  557. minOfInputVariables[2] = 0.99; // fcharge
  558. minOfInputVariables[3] = MIN_SEQ_SIZE; // plen
  559. minOfInputVariables[4] = -0.01; // prop_len
  560. minOfInputVariables[5] = -0.01; // pr_hkr
  561. minOfInputVariables[6] = -0.01; // n_x
  562. minOfInputVariables[7] = -0.01; // c_x
  563. multilayerPerceptron.setMinimumOfInputVariables(minOfInputVariables);
  564.  
  565. Flood::Vector<double> maxOfInputVariables(inputNodes);
  566. maxOfInputVariables[0] = 4.01; // iontype
  567. maxOfInputVariables[1] = 3.01; // pcharge
  568. maxOfInputVariables[2] = 2.01; // fcharge
  569. maxOfInputVariables[3] = MAX_SEQ_SIZE; // plen
  570. maxOfInputVariables[4] = 1.01; // prop_len
  571. maxOfInputVariables[5] = 1.01; // pr_hkr
  572. maxOfInputVariables[6] = 1.01; // n_x
  573. maxOfInputVariables[7] = 1.01; // c_x
  574. multilayerPerceptron.setMaximumOfInputVariables(maxOfInputVariables);
  575.  
  576.  
  577. // set up the relevant max and min etc for the output neuron(s)
  578. Flood::Matrix<double> minimumAndMaximumOfTargetData = inputTargetDataSet.getMinimumAndMaximumOfTargetData();
  579. cout << "min encountered output: " << minimumAndMaximumOfTargetData[0][0] << endl;
  580. cout << "max encountered output: " << minimumAndMaximumOfTargetData[1][0] << endl;
  581. double minBuffer = minimumAndMaximumOfTargetData[0][0] - minimumAndMaximumOfTargetData[0][0]/10; // min plus some buffer for unexpectedly low results
  582. double maxBuffer = minimumAndMaximumOfTargetData[1][0] + minimumAndMaximumOfTargetData[1][0]/10;
  583. cout << "minBuffer: " << minBuffer << " maxBuffer: " << maxBuffer <<endl;
  584.  
  585. Flood::Vector<double> minOfOutputVariables(outputNodes);
  586. minOfOutputVariables[0] = minBuffer; // 10% buffer under the min value encountered in training set
  587. Flood::Vector<double> maxOfOutputVariables(outputNodes);
  588. maxOfOutputVariables[0] = maxBuffer; // 10% buffer over the max value encountered in training set
  589. multilayerPerceptron.setMinimumOfOutputVariables(minOfOutputVariables);
  590. multilayerPerceptron.setMaximumOfOutputVariables(maxOfOutputVariables);
  591.  
  592. multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MinimumAndMaximum);
  593. multilayerPerceptron.print();
  594.  
  595. Flood::MeanSquaredError meanSquaredError(&multilayerPerceptron,&inputTargetDataSet);
  596. //multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MeanAndStandardDeviation);
  597. Flood::QuasiNewtonMethod quasiNewtonMethod(&meanSquaredError);
  598. quasiNewtonMethod.setEvaluationGoal(0.0);
  599. quasiNewtonMethod.setGradientNormGoal(0.0001);
  600. quasiNewtonMethod.setMaximumTime(320.0); // seconds
  601. quasiNewtonMethod.setMaximumNumberOfEpochs(20);
  602. quasiNewtonMethod.train();
  603.  
  604.  
  605. quasiNewtonMethod.saveTrainingHistory(trainingHistory);
  606. multilayerPerceptron.save(savedModel);
  607. cout << "neural network saved to: " << savedModel << endl;
  608.  
  609.  
  610. return;
  611. }
  612. void createTrainingFile() {
  613. // READ IN IDENTIFIED PEPTIDES
  614. cout << "Learning from: " << inputFile << endl;
  615. ifstream inputHits;
  616. inputHits.open(inputFile); // // load in the the input file in binary format
  617.  
  618. ofstream tempTrainingSet;
  619. tempTrainingSet.open(tempTrainingfile);
  620.  
  621. ofstream trainingSet; // writes to immediately if we're not using fuzzy outputs (as it doesn't need to be balanced).
  622. trainingSet.open(trainingfile);
  623.  
  624. ofstream testingSet;
  625. testingSet.open(testingfile);
  626.  
  627. ofstream testTrainLog;
  628. testTrainLog.open("testTrain.log");
  629.  
  630. vector<vector<double>> pmasses; // records <mass,error> for all parent masses
  631. vector<vector<double>> fragmasses; // records <mass,error> for all fragment masses
  632.  
  633. vector<string> trainSet; // training set - saved in MultiLayerPerceptron input file
  634. vector<string> testSet; // testing set - saved as a text list of inputs with correct output as last figure.
  635.  
  636. string line ;
  637. int verbosity = 0;
  638. double loaded = 0; // total
  639. double testing = 0; // those used in testing
  640. while (!inputHits.eof()) {
  641. hit h = loadInHit(inputHits); loaded++;
  642. if (h.seq.compare("END")==0) {break;} // no more available (loadInHit returns 'END' at eof)
  643. if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
  644. bool train; // if set 1, used for training, else used for testing
  645. if ((testing/loaded) <= propTesting) {train=0; testing++;} else {train=1;} // if not train, then test
  646. cout << "processing " << h.seq << endl;
  647.  
  648. vector<string> instances = processHit(h,train); // creates attributes for training with this peptide
  649.  
  650. //for (unsigned i=0; i < instances.size(); i++) {
  651. // cout << instances.at(i);
  652. //}
  653.  
  654. if (train==0) { // peptide allocated for testing set - dump to file
  655. for (unsigned i=0; i < instances.size(); i++) {testingSet << instances.at(i);}
  656. testingSet << endl; // peptides separated by newlines.
  657. testing++;
  658. }
  659. if (train==1) { // peptide allocated for testing set - save later writing
  660. for (unsigned i=0; i < instances.size(); i++) {trainSet.push_back(instances.at(i));}
  661. }
  662. //if (processed > 50) {break;}
  663. //break;
  664. }
  665.  
  666. //cout << "training set size: " << trainSet.size(); exit(0);
  667.  
  668. string inputs = "ION_TYPE PCHARGE FCHARGE PLEN PROPLEN PRHKR N_X C_X"; //inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
  669. string outputs = "LOG_INTEN_NORM"; // proportion of the total intensity
  670.  
  671. trainingSetHeader(trainingSet, trainSet.size(), inputNodes, outputNodes, inputs, outputs);
  672. for (unsigned i=0; i < trainSet.size(); i++) {trainingSet << trainSet.at(i);}
  673. return;
  674. }
  675. vector<string> processHit(hit h, bool train) { // set train to 1 if it's going to the training set, else it goes to the testing set
  676. vector<Flood::Vector<double>> matrix; // a vector of Flood Style vectors
  677. for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
  678. fragmatch fm = h.peaks.at(i);
  679. Flood::Vector<double> attributes = getAttributes(h.peaks.at(i),h.seq,h.pcharge);
  680. if (attributes.getSize()==0) {
  681. cout << "no attributes generated for peak: " << i << " with " << fm.fragType << endl;
  682. continue;
  683. } // no identified matching fragment
  684. //cout << "for hit: " << fm.fragType << " " << fm.fragDesc << " " << " int: " << fm.intensity;
  685. //cout << " attributes="; printFloodVector(attributes);
  686. matrix.push_back(attributes);
  687. }
  688.  
  689. // PEAK PROCESSING GOES IN HERE
  690. unsigned sumIntensities=0;
  691. for (unsigned i=0; i < h.peaks.size(); i++) {sumIntensities+=h.peaks.at(i).intensity;}
  692.  
  693. //cout << "sumInt: " << sumIntensities << endl;
  694.  
  695. vector<double> processedIntensities;
  696. for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
  697. fragmatch fm = h.peaks.at(i);
  698. if (fm.expmz == 0) {continue;}
  699. unsigned intensity = fm.intensity;
  700. //double prop = ((double) intensity / (double) sumIntensities);
  701. double processed = (double) intensity/ (double) sumIntensities;
  702. if (processed>0.2) {processed=0.2;}
  703. processedIntensities.push_back(processed);
  704. }
  705. //exit(0);
  706.  
  707. //processedIntensities = normalise(processedIntensities,100);
  708.  
  709. // convert the matrix to strings of attributes with the processed answer
  710. vector<string> outBlock; // each string corresponds to a line in a training or testing set
  711. for (unsigned i=0; i < matrix.size(); i++) {
  712. Flood::Vector<double> in = matrix.at(i);
  713. string out;
  714. for (int j=0; j < in.getSize(); j++) {
  715. out+=stringify(in[j]);
  716. out+=string(" ");
  717. }
  718. out+=stringify(processedIntensities.at(i));
  719. out+="\n";
  720. outBlock.push_back(out);
  721. }
  722. return outBlock; /// just a list of all hits and misses (with fuzzy outputs).
  723. }
  724. void trainingSetHeader(ofstream& f, int samples, int inputs, int outputs, string inputNames, string outputNames) {
  725. f << "% Testing Dataset for PINNACLE\n";
  726. f << "NumberOfSamples:\n" << samples << endl;
  727. f << "NumberOfInputVariables:\n" << inputs << endl;
  728. f << "NumberOfTargetVariables:\n" << outputs << endl;
  729. f << "NameOfInputVariables:\n" << inputNames << endl;
  730. f << "NameOfTargetVariables:\n" << outputNames << endl;
  731. f << "InputAndTargetData:\n";;
  732. return;
  733. }
  734. string convert2str(vector<double> v) { // adds newline at end
  735. string out;
  736. for (unsigned i=0; i < v.size(); i++) {
  737. char buffer [50]; int n;
  738. n=sprintf_s (buffer, 50, "%4.6f", v.at(i));
  739. out += string(buffer);
  740. if (i<v.size()-1) {out+=" ";} else {out+="\n";}
  741.  
  742. }
  743. //cout << "string out: " << out;
  744.  
  745. return out;
  746.  
  747. }
  748. vector<double> fuzzyOut(float intensity, float maxpeak) { // maxpeak may also be the sumOfPeaks if normalising against total peak intensities.
  749. vector<double> out (outputNodes,0);
  750. double prop = (double) intensity/(double) maxpeak; // the proportional intensity of this peak w.r.t. the maximum in the spect
  751. prop*=1000; // multiplier to make all values greater then zero (since we'll be taking a log)
  752.  
  753. // DIRECT OUTPUT - LOG of the proportion of the sum of intensities this fragment has. Gives us a nice normal curve for learning.
  754. if (intensity==0) {out.at(0)=0;}
  755. else {out.at(0) = log(prop); if (out.at(0) < 0) {out.at(0) =0;}} // don't want negative logs thanks
  756. return out;
  757.  
  758. // FUZZY OUTPUT
  759. //cout << "for proportion: " << prop << endl;
  760. for (unsigned i=0; i < outputNodes; i++) {
  761. double node = i*(1/((double) outputNodes-1)); //cout << "node " << i << " with value " << node << endl; // current node
  762. double diff = fabs(prop-node);
  763. double decay = 3*diff; // decay will be zero when bang on - tells the program how much weight to put on neighbouring nodes.
  764. double value = 1-decay; if (value < 0){value=0;}// cout << "diff: " << diff << " decay: " << decay << " value: " << value << endl;
  765. out.at(i) = value;
  766. }
  767. cout << "fuzzyOut: <"; for (unsigned i=0; i < outputNodes; i++) {cout << out.at(i) << ',';} cout << '>' << endl;
  768.  
  769. return out;
  770. }
  771. fragmatch findMatch(string iontype,unsigned breakSite,hit h) {
  772. // always returns the first such hit. Ie ignores whether is a water loss etc.
  773.  
  774. // basically, scan through the list of matches, and if a hit is found, return it.
  775. //cout << "searching for ion type: " << iontype << " with breakSite at " << breakSite << endl;
  776.  
  777. if (iontype.compare("nofrag")==0) { // parent peptide
  778. for (unsigned i=0; i < h.peaks.size(); i++) {
  779. fragmatch f = h.peaks.at(0);
  780. if (!f.fragType.substr(0,6).compare("nofrag")) {return f;}
  781. }
  782. }
  783. if (iontype.compare("immo")==0) { // immonium fragmatches have a default fragSite of 0.
  784. // here we just need to find a fragment match with the right description.
  785. char aa = h.seq.at(breakSite); // this is the specifc immonium ion we're looking for
  786. //cout << "looking for immo ion: " << aa << endl;
  787. for (unsigned i=0; i < h.peaks.size(); i++) {
  788. fragmatch f = h.peaks.at(i);
  789. //cout << "looking at peak: " << f.fragType << " obsmz: " << f.obsmz << endl;
  790. if (f.obsmz == 0) {continue;} // there is no match here.
  791. if (f.fragType.at(0) == 'i' && f.fragDesc.at(f.fragDesc.size()-1) == aa) {
  792. //cout << "this peak has been found: " << f.fragType << " intensity: " << f.intensity << endl;
  793. return f;
  794. }
  795. }
  796. }
  797. if (iontype.substr(0,2).compare("y1") == 0) {
  798. for (unsigned i=0; i < h.peaks.size(); i++) {
  799. fragmatch f = h.peaks.at(i);
  800. //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
  801. if (f.fragSite == breakSite) {
  802. if (!f.fragType.substr(0,2).compare("y1")) {return f;}
  803. }
  804. }
  805. }
  806. if (iontype.substr(0,2).compare("b1") == 0) {
  807. for (unsigned i=0; i < h.peaks.size(); i++) {
  808. fragmatch f = h.peaks.at(i);
  809. //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
  810. if (f.fragSite == breakSite) {
  811. if (!f.fragType.substr(0,2).compare("b1")) {return f;}
  812. }
  813. }
  814. }
  815.  
  816. // getting to here means we didn't find a matching peak.
  817. fragmatch nomatch;
  818. nomatch.obsmz = 0; // this how we define a missing peak.
  819. return nomatch;
  820.  
  821. }
  822.  
  823. hit loadInHit(ifstream& file) {
  824. // given the filestream to the output file, loads in a hit. Should be replaced with binary files.
  825. // seq:pcharge:obspmass:exppmass:stdDevs:basicHits:prelimScore:LOD:{mod0:mod1..}:{match1,match2...}:overallLOD\n
  826. // where match = [intensity:obsfmass:expfmass:type:description:LOD]
  827. string line;
  828. hit h;
  829.  
  830. getline( file, line );
  831. if (file.eof()) {h.seq = "END"; return h;}
  832.  
  833. //cout << line;
  834. unsigned i=0; unsigned j=0; while (line.at(j) != ':') {j++;}
  835. h.seq = line.substr(i,j-i); // identified sequence
  836. if (verbosity > 3) {cout << "seq loaded: " << h.seq << endl;}
  837.  
  838.  
  839. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.pcharge = convert2int(line.substr(i,j-i));
  840. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.obspmass = convert2double(line.substr(i,j-i));
  841. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.exppmass = convert2double(line.substr(i,j-i));
  842. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.Zscore = convert2int(line.substr(i,j-i));
  843. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.basichits = convert2int(line.substr(i,j-i));
  844. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.prelimScore = convert2double(line.substr(i,j-i));
  845. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.overallLOD = convert2double(line.substr(i,j-i));
  846.  
  847. i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.mods = convert2mods(line.substr(i,j-i));
  848. i=j+1; j=j+1; while (line.at(j) != '}') {j++;} h.peaks = convert2peaks(line.substr(i,j-i));
  849.  
  850. //cout << "exiting"; exit(0);
  851. return h;
  852. }
  853.  
  854. vector<double> convert2mods(string token) {
  855. vector<double> mods;
  856. //cout << "converting to mod vector<double>: " << token << endl;
  857. unsigned i=1; unsigned j=2; // first char is '{'
  858. while (j < token.size()) { // for each mod mass
  859. if (token.at(j) != ',' || j < token.size()-2) {j++;}
  860. string tmp = token.substr(i,j-i-1); //cout << "modmass: " << tmp << endl;
  861. double modmass = convert2double(tmp);
  862. mods.push_back(modmass);
  863. i=j; j++;
  864.  
  865. }
  866. return mods;
  867. }
  868.  
  869. vector<fragmatch> convert2peaks(string token) {
  870. vector<fragmatch> matches;
  871. // first break the line into individual matches
  872. vector<string> lines;
  873. unsigned i=0; unsigned j=0;
  874. while (j < token.size() && i < token.size()) {
  875. while (token.at(i) != '[') {i++;}
  876. while (token.at(j) != ']') {j++;}
  877. //cout << "pushing back: " << token.substr(i+1,j-i-1) << endl;
  878. string strfrag = token.substr(i+1,j-i-1); // cout << "parts: " << strfrag << endl;
  879. unsigned k=0;
  880. vector<unsigned> commaPos; // positions of the commas in the string
  881. fragmatch f;
  882. while (k < strfrag.size()) {if (strfrag.at(k) == ',') {commaPos.push_back(k);} k++;}
  883.  
  884. /*cout << token.at(i) << endl;
  885. cout << "commaPos: "; for (unsigned p=0; p < commaPos.size(); p++) {cout << commaPos.at(p) << ',';} cout << endl;
  886. cout << "intensity: " << strfrag.substr(0,commaPos.at(0)) << endl;
  887. cout << "obsMz: " << strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1) << endl;
  888. cout << "expMZ: " << strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1) << endl;
  889. cout << "fcharge: " << strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1) << endl;
  890. cout << "type: " << strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1) << endl;
  891. cout << "desc: " << strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1) << endl;
  892. cout << "fragLOD: " << strfrag.substr(commaPos.at(5)+1) << endl;*/
  893.  
  894. f.intensity = convert2int(strfrag.substr(0,commaPos.at(0)));
  895. f.obsmz = convert2double(strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1));
  896. f.expmz = convert2double(strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1));
  897. f.fcharge = convert2int(strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1));
  898.  
  899. if (f.expmz == 0.0) {i=j; j++; continue;} // missing peak isn't of any use at the moment.
  900.  
  901. f.fragType = strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1);
  902. f.fragDesc = strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1);
  903. f.fragLOD = convert2double(strfrag.substr(commaPos.at(5)+1));
  904.  
  905. // calculate the fragmentation site. Zero is the whole peptide and we go from N to C term
  906. if (f.fragType.at(0) == 'i' || f.fragType.at(0) == 'n') {f.fragSite = 0;} // no frag or immonium
  907. else { // x/y/z or a/b/c ion
  908. if (f.fragType.at(0) == 'y') {
  909. unsigned site=1;
  910. while (f.fragDesc.at(site) == '.') {site++;}
  911. f.fragSite = site; //continue;
  912. }
  913. if (f.fragType.at(0) == 'a' || f.fragType.at(0) == 'b') {
  914. unsigned site=1;
  915. while (f.fragDesc.at(site) != '.') {site++;}
  916. f.fragSite = site; //continue;
  917. }
  918. if (f.fragType.at(0) != 'a' && f.fragType.at(0) != 'b' && f.fragType.at(0) != 'y') {
  919. cout << "fragType " << f.fragType << " not yet implemented" << endl;
  920. i=j; j++; continue; // does not record frag site
  921. }
  922.  
  923. }
  924.  
  925. matches.push_back(f); // add this peak
  926.  
  927. i=j; j++; // continue onto next fragment.
  928. }
  929. return matches;
  930. }
  931.  
  932. double convert2double(string str) {
  933. char * cstr;
  934. cstr = new char [str.size()+1]; // convert to c_str
  935. strcpy_s (cstr, str.size()+1, str.c_str());
  936. double value = atof(cstr); //convert to double
  937. return value;
  938. }
  939. int convert2int(string str) {
  940. char * cstr;
  941. cstr = new char [str.size()+1]; // convert to c_str
  942. strcpy_s (cstr, str.size()+1, str.c_str());
  943. int value = atoi(cstr); //convert to int
  944. return value;
  945. }
  946. double encodeAA(char aa) { // encodes AA letter names into numbers
  947. if (aa == 'X') {return 0.00;} // if there is no AA related here, ie N-term side of a b1 ion
  948. if (aa == 'V') {return 0.00;} // hydrophobic side groups
  949. if (aa == 'L') {return 0.05;}
  950. if (aa == 'I') {return 0.10;}
  951. if (aa == 'M') {return 0.15;}
  952. if (aa == 'F') {return 0.20;}
  953. if (aa == 'G') {return 0.25;} // in between
  954. if (aa == 'A') {return 0.30;}
  955. if (aa == 'S') {return 0.35;}
  956. if (aa == 'T') {return 0.40;}
  957. if (aa == 'Y') {return 0.45;}
  958. if (aa == 'W') {return 0.50;}
  959. if (aa == 'C') {return 0.55;}
  960. if (aa == 'P') {return 0.60;}
  961. if (aa == 'N') {return 0.65;} // hydrophilic side groups
  962. if (aa == 'E') {return 0.70;}
  963. if (aa == 'Q') {return 0.75;}
  964. if (aa == 'D') {return 0.80;}
  965. if (aa == 'H') {return 0.85;}
  966. if (aa == 'K') {return 0.90;}
  967. if (aa == 'R') {return 0.95;}
  968. //std::cout << "EncodeAA -> unknown AA letter code: " << aa << std::endl;
  969. return 0.0; // separate value for unknown
  970.  
  971. };
  972. double countHKR(string& seq) { // count the number of H, K or R residues in the input sequence
  973. double HKR=0;
  974. for (unsigned i=0; i<seq.size(); i++) {
  975. char aa = seq.at(i);
  976. if (aa == 'H') {HKR++; continue;}
  977. if (aa == 'K') {HKR++; continue;}
  978. if (aa == 'R') {HKR++; continue;}
  979. }
  980. return HKR;
  981. }
  982. double getMean(vector<double>& v) { // returns the mean
  983. double total=0;
  984. for (unsigned i=0; i < v.size(); i++) {total+= v.at(i);}
  985. return total/v.size();
  986. }
  987. double getStdDev(vector<double>& v, double mean){ // returns the standard deviation of input array, second arg is the precomputed vector mean.
  988. double numerator=0;
  989. for (unsigned i=0; i < v.size(); i++) {numerator+= pow((v.at(i)-mean),2);}
  990. return pow(numerator/v.size(),0.5);
  991. }
  992.  
  993.  
  994. /*
  995. EXAMPLE OF AN OUTPUT FILE WHICH MUST BE PROCESSED AND LEARNED:
  996.  
  997. Search Parameters:
  998. parentTol=1
  999. tol=0.5
  1000. minBasicHits=4
  1001. minPrelimScore=0.05
  1002. randomSamples=25
  1003. randomLODspread=20
  1004. reportMin=5
  1005. topCandidates=0.3
  1006. maxTopCandidates=20
  1007. newmetric=0
  1008. maxFragCharge=2
  1009. verbosity=1
  1010. devsAboveMeanForHit=2
  1011. collectRandomHits=0
  1012. MOD0 = iodacetamide (static) mass: 57 acting on: C
  1013. MOD1 = oxidation (variable) mass: 16 acting on: M H W
  1014. MOD2 = deamidation (variable) mass: 1 acting on: N Q
  1015.  
  1016. MOD0 MOD1 MOD2 Delta
  1017. 0 0 0 0
  1018. 0 1 0 16
  1019. 0 2 0 32
  1020. 1 0 0 57
  1021. 1 1 0 73
  1022. 1 2 0 89
  1023. 2 0 0 114
  1024. 2 1 0 130
  1025. 2 2 0 146
  1026. loading peptide database: Forward Db: 1689106 unique peptides loaded from peptides.dat
  1027. loading peptide database: Reverse Db: 683090 unique peptides loaded from reverse.dat
  1028. DTAFiles\AGTEPSDAIR.mgf.dta
  1029. top ten candidates for pmass: 1015.49 DTAFiles\AGTEPSDAIR.mgf.dta
  1030. AGTEPSDAIR prelim: 0.652732 LOD: 25.9144
  1031. ASAEQSGGGPR prelim: 0.498535 LOD: 23.7549
  1032. ASAVSELSPR prelim: 0.630028 LOD: 23.7549
  1033. TFAHNSALR prelim: 0.559211 LOD: 22.6751
  1034. APQFTILAR prelim: 0.651944 LOD: 22.6751
  1035. ASAAPPHPLR prelim: 0.507099 LOD: 20.5156
  1036. TAEEKVSPR prelim: 0.501239 LOD: 19.4358
  1037. EDSVLNIAR prelim: 0.502028 LOD: 19.4358
  1038. TTSISPALAR prelim: 0.597014 LOD: 19.4358
  1039. TGDPQETLR prelim: 0.523944 LOD: 18.356
  1040. random mean: 9.58833 random StdDev: 2.37195
  1041. std Devs above mean: 6.88297
  1042. IDENTIFIED: AGTEPSDAIR
  1043. observed pmass: 1015.49 pcharge: 1
  1044. identified pmass: 1015.48
  1045. mods: {0,0,0,0,0,0,0,0,0,0,}
  1046. basicHits: 8
  1047. prelimScore: 0.652732
  1048. Top 4N peaks: AGTEPSDAIR
  1049. intensity obsMZ expMZ type seq LOD
  1050.   2265 86.10 86.08 immo I 1.080
  1051.   2049 70.07 70.05 immo P 1.080
  1052.   1672 175.13 175.10 y1+ .........R 2.160
  1053.   1532 359.26 359.15 b1+ AGTE...... 2.160
  1054.   816 658.43 658.26 b1+ AGTEPSD... 2.160
  1055.   751 74.07 74.05 immo T 1.080
  1056.   746 87.09 0.00 0.000
  1057.   576 112.09 0.00 0.000
  1058.   460 69.88 70.05 immo P 1.080
  1059.   453 102.06 102.04 immo E 1.080
  1060.   390 85.84 86.08 immo I 1.080
  1061.   390 110.09 0.00 0.000
  1062.   385 84.07 0.00 0.000
  1063.   382 371.19 0.00 0.000
  1064.   378 300.15 0.00 0.000
  1065.   370 72.09 72.04 b1+ A......... 2.160
  1066.   355 230.15 230.11 b1+ AGT....... 2.160
  1067.   348 159.11 0.00 0.000
  1068.   322 185.12 0.00 0.000
  1069.   321 101.09 101.06 a1+ AG........ 1.080
  1070.   318 641.39 641.33 y1+-1NH3 ....PSDAIR 1.080
  1071.   273 120.09 0.00 0.000
  1072.   264 71.88 72.04 b1+ A......... 2.160
  1073.   251 158.11 158.10 y1+-1NH3 .........R 1.080
  1074.   242 157.13 0.00 0.000
  1075.   229 109.71 0.00 0.000
  1076.   210 136.09 0.00 0.000
  1077.   209 129.11 129.06 b1+ AG........ 2.160
  1078.   200 83.81 0.00 0.000
  1079.   200 119.66 0.00 0.000
  1080.   199 212.13 212.11 b1+-1H2O AGT....... 1.080
  1081.   194 130.11 0.00 0.000
  1082. overall LOD score: 25.9144
  1083.  
  1084. DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
  1085. loading obs spect: 0
  1086. top ten candidates for pmass: 2882.55 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
  1087. ALIPSSQNEVELGELLLSLNYLPSAGR prelim: 0.252294 LOD: 66.5238
  1088. QMNLNPMDSPHSPISPLPPTLSPQPR prelim: 0.10953 LOD: 34.8458
  1089. GETLGNIPLLAIGPAICLPGIAAIALARK prelim: 0.0974537 LOD: 26.9263
  1090. VSSDPAWAVEWIELPRGLSLSSLGSAR prelim: 0.0921176 LOD: 25.3424
  1091. DAAIPAPHAPSAPGCLASVKPGGWKGAAGR prelim: 0.0996068 LOD: 25.3424
  1092. MEDTDFVGWALDVLSPNLISTSMLGR prelim: 0.0916495 LOD: 23.7585
  1093. MALLLLSLGLSLIAAQEFDPHTVMQR prelim: 0.0923048 LOD: 23.7585
  1094. QALTPEQVSFCATHMQQYMDPRGR prelim: 0.0931474 LOD: 23.7585
  1095. NTSIRVADFGSATFDHEHHTTIVATR prelim: 0.0745179 LOD: 22.1746
  1096. DLLEDKILHSHLVDYFPEFDGPQR prelim: 0.107471 LOD: 22.1746
  1097. random mean: 11.9109 random StdDev: 4.05973
  1098. std Devs above mean: 13.4523
  1099. IDENTIFIED: ALIPSSQNEVELGELLLSLNYLPSAGR
  1100. observed pmass: 2882.55 pcharge: 1
  1101. identified pmass: 2882.47
  1102. mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
  1103. basicHits: 18
  1104. prelimScore: 0.252294
  1105. Top 4N peaks: ALIPSSQNEVELGELLLSLNYLPSAGR
  1106. intensity obsMZ expMZ type seq LOD
  1107.   3138 95.95 0.00 0.000
  1108.   307 487.20 487.24 y1+ ......................PSAGR 3.168
  1109.   279 86.09 86.08 immo L 1.584
  1110.   253 112.09 0.00 0.000
  1111.   237 763.35 763.38 y1+ ....................YLPSAGR 3.168
  1112.   198 175.09 175.10 y1+ ..........................R 3.168
  1113.   174 600.31 600.32 y1+ .....................LPSAGR 3.168
  1114.   171 70.05 70.05 immo P 1.584
  1115.   155 470.23 470.24 y1+-1NH3 ......................PSAGR 1.584
  1116.   151 1716.85 0.00 0.000
  1117.   141 1416.73 1416.77 y1+ ..............LLLSLNYLPSAGR 3.168
  1118.   141 1715.73 1715.91 y1+ ...........LGELLLSLNYLPSAGR 3.168
  1119.   133 232.10 232.12 y1+ .........................GR 3.168
  1120.   119 400.13 0.00 0.000
  1121.   108 990.44 990.50 y1+ ..................LNYLPSAGR 3.168
  1122.   104 185.09 185.12 b1+ AL......................... 3.168
  1123.   103 542.20 0.00 0.000
  1124.   102 877.33 877.42 y1+ ...................NYLPSAGR 3.168
  1125.   98 303.16 303.16 y1+ ........................AGR 3.168
  1126.   97 1303.62 1303.69 y1+ ...............LLSLNYLPSAGR 3.168
  1127.   96 1943.86 1944.02 y1+ .........VELGELLLSLNYLPSAGR 3.168
  1128.   95 471.21 0.00 0.000
  1129.   95 871.25 0.00 0.000
  1130.   92 1077.48 1077.53 y1+ .................SLNYLPSAGR 3.168
  1131.   80 742.28 0.00 0.000
  1132.   73 298.18 298.20 b1+ ALI........................ 3.168
  1133.   71 643.19 0.00 0.000
  1134.   71 860.35 860.42 y1+-1NH3 ...................NYLPSAGR 1.584
  1135.   66 644.09 644.37 b1+-1NH3-2H2O ALIPSSQ.................... 1.584
  1136.   66 743.40 0.00 0.000
  1137.   63 756.27 0.00 0.000
  1138.   60 187.09 0.00 0.000
  1139.   58 391.19 0.00 0.000
  1140.   58 449.25 0.00 0.000
  1141.   58 2588.10 0.00 0.000
  1142.   57 655.32 0.00 0.000
  1143.   57 1602.76 1602.83 y1+ ............GELLLSLNYLPSAGR 3.168
  1144.   56 497.16 0.00 0.000
  1145.   54 228.08 0.00 0.000
  1146.   54 859.82 859.42 y1+-1H2O ...................NYLPSAGR 1.584
  1147.   53 288.63 0.00 0.000
  1148.   53 289.17 0.00 0.000
  1149.   51 243.16 0.00 0.000
  1150.   51 320.48 0.00 0.000
  1151.   51 409.64 0.00 0.000
  1152.   47 2072.96 2073.06 y1+ ........EVELGELLLSLNYLPSAGR 3.168
  1153.   46 1843.89 0.00 0.000
  1154.   40 2187.00 2187.10 y1+ .......NEVELGELLLSLNYLPSAGR 3.168
  1155.   36 2167.81 0.00 0.000
  1156.   36 2404.79 0.00 0.000
  1157.   1655 96.07 0.00 0.000
  1158.   974 96.18 0.00 0.000
  1159.   958 95.89 0.00 0.000
  1160. overall LOD score: 66.5238
  1161.  
  1162. DTAFiles\AQMDSSQLIHCR.mgf.dta
  1163. top ten candidates for pmass: 1444.43 DTAFiles\AQMDSSQLIHCR.mgf.dta
  1164. FDHNAIMKILDL prelim: 0.272246 LOD: 23.6381
  1165. MAEPGHSHHLSAR prelim: 0.381321 LOD: 21.0117
  1166. MPHLRLVLPITR prelim: 0.299034 LOD: 19.6984
  1167. VMSKSAADIIALAR prelim: 0.295296 LOD: 19.6984
  1168. IEHSTFDGLDRR prelim: 0.361541 LOD: 18.3852
  1169. MESDRLIISLPR prelim: 0.217111 LOD: 17.072
  1170. IDAMHGVMGPYVR prelim: 0.284083 LOD: 17.072
  1171. WQLNPGMYQHR prelim: 0.317101 LOD: 17.072
  1172. GTHPAIHDILALR prelim: 0.294206 LOD: 17.072
  1173. SPSSLLHDPALHR prelim: 0.294206 LOD: 15.7588
  1174. random mean: 10.2432 random StdDev: 3.34294
  1175. std Devs above mean: 4.00694
  1176. IDENTIFIED: FDHNAIMKILDL
  1177. observed pmass: 1444.43 pcharge: 1
  1178. identified pmass: 1444.66
  1179. mods: {0,0,0,0,0,0,16,0,0,0,0,0,}
  1180. basicHits: 5
  1181. prelimScore: 0.272246
  1182. Top 4N peaks: FDHNAIMKILDL
  1183. intensity obsMZ expMZ type seq LOD
  1184.   2157 1000.35 0.00 0.000
  1185.   1812 110.06 110.00 immo H 1.313
  1186.   1360 175.08 0.00 0.000
  1187.   1032 200.09 0.00 0.000
  1188.   922 455.11 0.00 0.000
  1189.   830 86.08 86.08 immo I 1.313
  1190.   648 101.06 101.09 immo K 1.313
  1191.   616 472.15 0.00 0.000
  1192.   552 70.06 0.00 0.000
  1193.   504 335.09 0.00 0.000
  1194.   492 89.06 0.00 0.000
  1195.   383 112.08 0.00 0.000
  1196.   330 585.21 585.18 b1+ FDHNA....... 2.626
  1197.   326 133.02 0.00 0.000
  1198.   298 1289.38 0.00 0.000
  1199.   284 229.11 0.00 0.000
  1200.   279 87.04 87.04 immo N 1.313
  1201.   265 913.36 0.00 0.000
  1202.   263 826.31 0.00 0.000
  1203.   254 104.05 104.04 immo M 1.313
  1204.   240 698.29 698.26 b1+ FDHNAI...... 2.626
  1205.   240 748.23 748.40 y1+ ......MKILDL 2.626
  1206.   239 298.05 0.00 0.000
  1207.   236 334.07 0.00 0.000
  1208.   230 166.04 0.00 0.000
  1209.   228 183.07 0.00 0.000
  1210.   223 247.07 247.11 y1+ ..........DL 2.626
  1211.   221 120.10 120.07 immo F 1.313
  1212.   221 1424.88 0.00 0.000
  1213.   213 1325.48 0.00 0.000
  1214.   210 329.15 0.00 0.000
  1215.   204 731.14 731.40 y1+-1NH3 ......MKILDL 1.313
  1216.   203 1354.55 0.00 0.000
  1217.   201 316.08 0.00 0.000
  1218.   201 1008.42 0.00 0.000
  1219.   199 568.22 568.18 b1+-1NH3 FDHNA....... 1.313
  1220.   196 639.28 0.00 0.000
  1221.   188 366.20 0.00 0.000
  1222.   186 84.04 0.00 0.000
  1223.   184 203.04 0.00 0.000
  1224.   181 331.14 0.00 0.000
  1225.   181 855.49 0.00 0.000
  1226.   177 217.08 0.00 0.000
  1227.   177 791.28 0.00 0.000
  1228.   174 318.11 0.00 0.000
  1229.   170 216.07 0.00 0.000
  1230.   167 861.24 861.48 y1+ .....IMKILDL 2.626
  1231.   165 531.15 0.00 0.000
  1232. overall LOD score: 23.6381
  1233.  
  1234. DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
  1235. top ten candidates for pmass: 1908.96 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
  1236. ATAAPAGAPPQPQDLEFTK prelim: 0.464155 LOD: 54.2672
  1237. KPTNLGTSCLLRHLQR prelim: 0.338948 LOD: 27.1336
  1238. APQPQRSASPALPFWTR prelim: 0.201213 LOD: 24.6669
  1239. STGWYPQPQINWSDSK prelim: 0.355488 LOD: 24.6669
  1240. ALQSAGPFGPYIPGGPAPGR prelim: 0.303074 LOD: 23.4336
  1241. QVSAESSSSMNSNTPLVR prelim: 0.205814 LOD: 23.4336
  1242. GKQSPHHERPEPEPPR prelim: 0.211954 LOD: 22.2002
  1243. ALRPKPTLEKDLEEDR prelim: 0.218912 LOD: 20.9669
  1244. IVEVLGIPPAHILDQAPK prelim: 0.343929 LOD: 20.9669
  1245. GQELHKPETGCQQELR prelim: 0.197297 LOD: 20.9669
  1246. random mean: 12.1361 random StdDev: 4.61053
  1247. std Devs above mean: 9.138
  1248. IDENTIFIED: ATAAPAGAPPQPQDLEFTK
  1249. observed pmass: 1908.96 pcharge: 1
  1250. identified pmass: 1908.95
  1251. mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
  1252. basicHits: 15
  1253. prelimScore: 0.464155
  1254. Top 4N peaks: ATAAPAGAPPQPQDLEFTK
  1255. intensity obsMZ expMZ type seq LOD
  1256.   6560 1299.53 1299.63 y1+ ........PPQPQDLEFTK 2.467
  1257.   5095 63.86 0.00 0.000
  1258.   5056 70.06 70.05 immo P 1.233
  1259.   2954 977.39 977.47 y1+ ...........PQDLEFTK 2.467
  1260.   2364 226.10 226.13 b1+-1H2O ATA................ 1.233
  1261.   2318 323.13 0.00 0.000
  1262.   1331 297.12 297.17 b1+-1H2O ATAA............... 1.233
  1263.   1183 315.12 315.17 b1+ ATAA............... 2.467
  1264.   1105 341.14 0.00 0.000
  1265.   987 663.24 0.00 0.000
  1266.   894 244.08 244.13 b1+ ATA................ 2.467
  1267.   879 86.08 86.08 immo L 1.233
  1268.   833 1595.70 1595.78 y1+ ....PAGAPPQPQDLEFTK 2.467
  1269.   828 101.06 101.06 immo Q 1.233
  1270.   701 611.25 611.32 b1+ ATAAPAGA........... 2.467
  1271.   688 141.08 0.00 0.000
  1272.   668 175.07 0.00 0.000
  1273.   661 169.08 0.00 0.000
  1274.   650 1427.57 1427.69 y1+ ......GAPPQPQDLEFTK 2.467
  1275.   613 269.12 0.00 0.000
  1276.   613 548.20 0.00 0.000
  1277.   596 933.36 933.48 b1+ ATAAPAGAPPQ........ 2.467
  1278.   588 905.35 905.48 a1+ ATAAPAGAPPQ........ 1.233
  1279.   572 1281.52 1281.63 y1+-1H2O ........PPQPQDLEFTK 1.233
  1280.   512 368.15 0.00 0.000
  1281.   495 110.06 0.00 0.000
  1282.   480 394.16 394.22 b1+-1H2O ATAAP.............. 1.233
  1283.   480 1202.47 1202.58 y1+ .........PQPQDLEFTK 2.467
  1284.   454 619.24 619.33 y1+-1H2O ..............LEFTK 1.233
  1285.   447 270.13 0.00 0.000
  1286.   439 776.31 0.00 0.000
  1287.   437 1370.58 1370.67 y1+ .......APPQPQDLEFTK 2.467
  1288.   433 98.05 0.00 0.000
  1289.   414 365.16 0.00 0.000
  1290.   403 72.07 72.04 b1+ A.................. 2.467
  1291.   396 337.15 0.00 0.000
  1292.   394 173.07 173.09 b1+ AT................. 2.467
  1293.   386 243.11 0.00 0.000
  1294.   370 126.03 0.00 0.000
  1295.   367 708.29 708.37 b1+ ATAAPAGAP.......... 2.467
  1296.   360 195.09 0.00 0.000
  1297.   350 248.13 248.14 y1+ .................TK 2.467
  1298.   330 112.06 0.00 0.000
  1299.   330 115.07 0.00 0.000
  1300.   329 212.10 0.00 0.000
  1301.   313 540.22 540.28 b1+ ATAAPAG............ 2.467
  1302.   311 209.06 0.00 0.000
  1303.   311 295.13 0.00 0.000
  1304.   310 167.09 0.00 0.000
  1305.   309 906.38 0.00 0.000
  1306.   303 155.07 155.09 b1+-1H2O AT................. 1.233
  1307.   299 382.15 0.00 0.000
  1308.   293 186.09 0.00 0.000
  1309.   278 74.05 74.05 immo T 1.233
  1310.   278 1052.46 0.00 0.000
  1311.   273 583.21 583.32 a1+ ATAAPAGA........... 1.233
  1312.   264 1088.41 1088.53 y1+-1NH3 ..........QPQDLEFTK 1.233
  1313.   262 355.12 0.00 0.000
  1314.   259 102.06 102.04 immo E 1.233
  1315.   245 338.15 0.00 0.000
  1316.   242 453.19 0.00 0.000
  1317.   239 520.24 0.00 0.000
  1318.   237 452.16 0.00 0.000
  1319.   232 198.10 0.00 0.000
  1320. overall LOD score: 54.2672
  1321.  
  1322. */
  1323.  
  1324. Flood::Vector<double> getAttributes(fragmatch& f,string seq, unsigned pcharge) {
  1325.  
  1326. //if (params.verbosity < 5) {cout << "getting attributes for frag " << f.type << " " << f.desc << endl; }
  1327. // given the seq, and the immonium fragment, returns the attribute vector to be fed into the neural net
  1328. vector<double> attributes; // holds the attributes for this fragment
  1329.  
  1330.  
  1331. // NO FRAGMENTATION - TYPE 0
  1332. if (f.fragType.substr(0,6).compare("nofrag")==0) {
  1333. attributes.push_back(0); // iontype 0 = unfragmented
  1334. attributes.push_back(pcharge);
  1335. attributes.push_back(f.fcharge);
  1336. attrib

Report this snippet  

You need to login to post a comment.