[Feature] export MSExperiment to data frames for MassQL #5722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

axelwalter merged 7 commits into OpenMS:develop from axelwalter:feature/massql-export

Dec 28, 2021

Collaborator

axelwalter commented Dec 22, 2021 •

edited by timosachsenberg

Loading

Description

Added a function for pyOpenMS MSExperiment to export peak data to MS1 and MS2 data frames in a format required by MassQL.

Please include a summary of the change and which issue is fixed.

Checklist:

Make sure that you are listed in the AUTHORS file
Add relevant changes and new features to the CHANGELOG file
I have commented my code, particularly in hard-to-understand areas
New and existing unit tests pass locally with my changes
Updated or added python bindings for changed or new classes. (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI:

If your PR is failing you can check out

http://cdash.openms.de/index.php?project=OpenMS and look for your PR. If you click in the column that lists the failed tests you will get detailed error messages.
Or click on the action: e.g., for clang-format linting

Note:

Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

Advanced commands (admins / reviewer only):

/rebase will try to rebase the PR on the current develop branch.
/reformat (experimental) applies the clang-format style changes as additional commit
setting the label NoJenkins will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)


          add massql dataframe export for MSExperiment

63ec9f8

axelwalter requested review from jpfeuffer and timosachsenberg

December 22, 2021 16:33

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Outdated Show resolved Hide resolved

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Outdated

+                          else:
+                              ms2_data = ()
+                          for i, peak in enumerate(spec):
+                              peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))

Contributor

timosachsenberg Dec 22, 2021

Suggested change

      
                            peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
          
                            peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60.0, get_polarity(spec))

Contributor

timosachsenberg Dec 22, 2021

I think we have a numpy function getPeakData... could this be used to make this faster?

Contributor

jpfeuffer Dec 22, 2021 •

edited

Loading

Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy

Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.

Collaborator Author

axelwalter Dec 23, 2021

Intersting, this would create the ndarray for the spectrum without numpy.asarray from a list of python lists. Will try...

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Show resolved Hide resolved

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Outdated Show resolved Hide resolved

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Show resolved Hide resolved

timosachsenberg requested a review from enetz

December 22, 2021 16:46

Contributor

timosachsenberg commented Dec 22, 2021

I also added @enetz he has more experience with pandas and numpy so he might spot some low hanging performance optimizations.

enetz reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Outdated

+                      def get_spec_arr(spec, scan_num):
+                          arr = np.asarray([peak_data for peak_data in get_peak_arrays_from_spec(spec, scan_num)], dtype='f')
+                          arr = np.insert(arr, 1, arr[:,0]/np.amax(arr[:,0]), axis=1) # i_norm
+                          arr = np.insert(arr, 2, arr[:,0]/np.sum(arr[:,0]), axis=1) # tic_i_norm

Contributor

enetz Dec 22, 2021

this can be made better readable by making a pd.DataFrame earlier on.
you can then use e.g.
arr['i_norm'] = arr[: , 0]/...
to add new columns and use pd.concat to stack them later on

Collaborator Author

axelwalter Dec 23, 2021

Using pandas at this point was significantly slower then numpy, maybe you can check the code I posted below if that is what you had in mind?

jpfeuffer reviewed

View reviewed changes

src/pyOpenMS/pyopenms/dataframes.py Outdated

+                      def get_peak_arrays_from_spec(spec, scan_num):
+                          if spec.getMSLevel() == 2:
+                              ms2_data = (spec.getPrecursors()[0].getMZ(), self.getPrecursorSpectrum(scan_num)+1, spec.getPrecursors()[0].getCharge())

Contributor

jpfeuffer Dec 22, 2021

Warn if there are multiple precursors? And maybe store the temporary first precursor one line above. Not sure if python is smart enough to make that optimization.

src/pyOpenMS/pyopenms/dataframes.py Outdated

+                          else:
+                              ms2_data = ()
+                          for i, peak in enumerate(spec):
+                              peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))

Contributor

jpfeuffer Dec 22, 2021 •

edited

Loading

Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy

Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.

src/pyOpenMS/pyopenms/dataframes.py Outdated Show resolved Hide resolved

src/pyOpenMS/pyopenms/dataframes.py Outdated Show resolved Hide resolved

Contributor

jpfeuffer commented Dec 22, 2021

Cool idea, see my comments for optimization. In the best case, keep track of times after every change you make ;)
Mine are mostly theoretical ideas.


          update get_massql_df

f1fb96b

Collaborator Author

axelwalter commented Dec 23, 2021

Thanks for the review! Very interesting. I tried some suggestions for optimization, like using pandas earlier with this code:

    def get_peak_arrays_from_spec(spec, scan_num):
        if spec.getMSLevel() == 2:
            ms2_data = (spec.getPrecursors()[0].getMZ(), self.getPrecursorSpectrum(scan_num)+1, spec.getPrecursors()[0].getCharge())
        else:
            ms2_data = ()
        for peak in spec:
            peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
            yield peak_data + ms2_data

    def get_spec_df(mslevel, dtypes):
        for scan_num, spec in enumerate(self):
            if spec.getMSLevel() == mslevel:
                df = pd.DataFrame(np.fromiter(get_peak_arrays_from_spec(spec, scan_num), dtype=dtypes))
                # df.insert() not slower then df['i_norm'] = ... (and have to reorder columns later)
                df.insert(1, 'i_norm', df['i']/df['i'].max()) # i_norm
                df.insert(2, 'i_tic_norm', df['i']/df['i'].sum()) # tic_i_norm
                yield df

    dtypes = [('i', 'float32'), ('mz', 'float64'), ('scan', 'int32'), ('rt', 'float32'), ('polarity', 'int32')]
    if 1 in self.getMSLevels():
        ms1_df = pd.concat(list(get_spec_df(1, dtypes)))
    else:
        ms1_df = pd.DataFrame(columns=[x[0] for x in dtypes])

    dtypes += [('precmz', 'float64'), ('ms1scan', 'int32'), ('charge', 'int32')]
    if 2 in self.getMSLevels():
        ms2_df = pd.concat(list(get_spec_df(2, dtypes)))
    else:
        ms2_df = pd.DataFrame(columns=[x[0] for x in dtypes])

    return ms1_df, ms2_df

However, this was significantly slower (about 10 seconds vs 2 seconds) so using numpy as much as possible seems to be working fine. Compared to file loading and exporting of data frames by the MassQL library the solution is also significantly faster (about 50%).

Added some more docs and changed the code to be more understandable, but was not able to get better run time compared to before.

But will still test some more of your suggestions now.

Contributor

jpfeuffer commented Dec 23, 2021

By the way, do I see it correctly that we are basically are exporting every single peak in the experiment, only split into a Ms1 and an Ms2 table?

I think then the ultimate solution would be to add an Ms_level parameter to our new getPeakData function in C++ (the one that @timosachsenberg did). I would love that addition anyway, in case someone loads a full experiment but only wants Ms1 peaks for plotting. I can take a short look.

Collaborator Author

axelwalter commented Dec 23, 2021

Yes that's basically it. Since we calculate the normalized intensities based on the spectrum, does it make sense to collect the data by iterating over the spectra? Because otherwise we have to split the larger array later for the calculations.

Contributor

jpfeuffer commented Dec 23, 2021 •

edited

Loading

Yeah exactly. I was thinking of just applying that function twice. Once for Ms1 and once for Ms2. I think in this case iteration is faster than memory management between c and python.
EDIT: I just found out, that the current get2DPeakData only exports MS1, due to the usage of AreaIterator. So I will just add another function to get MS2 peaks.

If we assume a fixed number of levels, we could even return a tuple of e.g. nine arrays:
Ms1 int,mz,rt, Ms2 int,mz,rt Ms3 int,mz,rt
This should be super fast 😂 but not sure if worth it.

Contributor

jpfeuffer commented Dec 23, 2021

Then you can use getMS2PeakData for MS2 peaks and get2DPeakData (with the full range) for MS1 peaks.
The rest should be straightforward. No need for loops or anything.

Collaborator Author

axelwalter commented Dec 23, 2021

In this case you would still need that right? Because we need collect the MS2 data (precursor mz, ms1scan, ...) and calculate normalizations for each spectrum.


          performance improvements

3f38797

Collaborator Author

axelwalter commented Dec 23, 2021

With the new changes the computation takes only 0.24 s compared to 2 s with my test file.

timosachsenberg mentioned this pull request

[FEATURE] add function to quickly export MS2 peaks in rt,mz,int format #5726

Closed

axelwalter requested a review from timosachsenberg

December 27, 2021 08:15

Contributor

timosachsenberg commented Dec 27, 2021

I think we might want to add a parameter for the level here: #5726


          add fallback if no precursor

14e1887

timosachsenberg approved these changes

View reviewed changes

Contributor

timosachsenberg commented Dec 27, 2021

would be good to have a test output to track regressions


          add massql test file

76d3387

timosachsenberg reviewed

View reviewed changes

src/pyOpenMS/tests/unittests/test000.py Outdated Show resolved Hide resolved


          fix relative path

cd83df9

Contributor

timosachsenberg commented Dec 28, 2021

sorry but the test file is too big (10 mb tsv). Needs to be smaller than 1mb. e.g., one of the BSA fractions might be still result in too big tsv file (https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/FRACTIONS/BSA1_F1.mzML) .
e.g. https://github.com/OpenMS/OpenMS/blob/develop/src/tests/topp/FileFilter_1_input.mzML could be of better size


          smaller test file

fc68a05

timosachsenberg approved these changes

View reviewed changes

axelwalter merged commit beeda3b into OpenMS:develop

axelwalter deleted the feature/massql-export branch

December 28, 2021 13:29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet