-
-
Notifications
You must be signed in to change notification settings - Fork 378
[Feature] export MSExperiment to data frames for MassQL #5722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/pyOpenMS/pyopenms/dataframes.py
Outdated
| else: | ||
| ms2_data = () | ||
| for i, peak in enumerate(spec): | ||
| peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec)) | |
| peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60.0, get_polarity(spec)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a numpy function getPeakData... could this be used to make this faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy
Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intersting, this would create the ndarray for the spectrum without numpy.asarray from a list of python lists. Will try...
|
I also added @enetz he has more experience with pandas and numpy so he might spot some low hanging performance optimizations. |
src/pyOpenMS/pyopenms/dataframes.py
Outdated
| def get_spec_arr(spec, scan_num): | ||
| arr = np.asarray([peak_data for peak_data in get_peak_arrays_from_spec(spec, scan_num)], dtype='f') | ||
| arr = np.insert(arr, 1, arr[:,0]/np.amax(arr[:,0]), axis=1) # i_norm | ||
| arr = np.insert(arr, 2, arr[:,0]/np.sum(arr[:,0]), axis=1) # tic_i_norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be made better readable by making a pd.DataFrame earlier on.
you can then use e.g.
arr['i_norm'] = arr[: , 0]/...
to add new columns and use pd.concat to stack them later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pandas at this point was significantly slower then numpy, maybe you can check the code I posted below if that is what you had in mind?
src/pyOpenMS/pyopenms/dataframes.py
Outdated
|
|
||
| def get_peak_arrays_from_spec(spec, scan_num): | ||
| if spec.getMSLevel() == 2: | ||
| ms2_data = (spec.getPrecursors()[0].getMZ(), self.getPrecursorSpectrum(scan_num)+1, spec.getPrecursors()[0].getCharge()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warn if there are multiple precursors? And maybe store the temporary first precursor one line above. Not sure if python is smart enough to make that optimization.
src/pyOpenMS/pyopenms/dataframes.py
Outdated
| else: | ||
| ms2_data = () | ||
| for i, peak in enumerate(spec): | ||
| peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy
Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.
|
Cool idea, see my comments for optimization. In the best case, keep track of times after every change you make ;) |
|
Thanks for the review! Very interesting. I tried some suggestions for optimization, like using pandas earlier with this code: However, this was significantly slower (about 10 seconds vs 2 seconds) so using numpy as much as possible seems to be working fine. Compared to file loading and exporting of data frames by the MassQL library the solution is also significantly faster (about 50%). Added some more docs and changed the code to be more understandable, but was not able to get better run time compared to before. But will still test some more of your suggestions now. |
|
By the way, do I see it correctly that we are basically are exporting every single peak in the experiment, only split into a Ms1 and an Ms2 table? I think then the ultimate solution would be to add an Ms_level parameter to our new getPeakData function in C++ (the one that @timosachsenberg did). I would love that addition anyway, in case someone loads a full experiment but only wants Ms1 peaks for plotting. I can take a short look. |
|
Yes that's basically it. Since we calculate the normalized intensities based on the spectrum, does it make sense to collect the data by iterating over the spectra? Because otherwise we have to split the larger array later for the calculations. |
|
Yeah exactly. I was thinking of just applying that function twice. Once for Ms1 and once for Ms2. I think in this case iteration is faster than memory management between c and python. If we assume a fixed number of levels, we could even return a tuple of e.g. nine arrays: |
|
Then you can use getMS2PeakData for MS2 peaks and get2DPeakData (with the full range) for MS1 peaks. |
|
In this case you would still need that right? Because we need collect the MS2 data (precursor mz, ms1scan, ...) and calculate normalizations for each spectrum. |
|
With the new changes the computation takes only 0.24 s compared to 2 s with my test file. |
|
I think we might want to add a parameter for the level here: #5726 |
|
would be good to have a test output to track regressions |
|
sorry but the test file is too big (10 mb tsv). Needs to be smaller than 1mb. e.g., one of the BSA fractions might be still result in too big tsv file (https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/FRACTIONS/BSA1_F1.mzML) . |
Description
Added a function for pyOpenMS MSExperiment to export peak data to MS1 and MS2 data frames in a format required by MassQL.
Please include a summary of the change and which issue is fixed.
Checklist:
How can I get additional information on failed tests during CI:
If your PR is failing you can check out
Note:
Advanced commands (admins / reviewer only):