Skip to content

Conversation

@axelwalter
Copy link
Collaborator

@axelwalter axelwalter commented Dec 22, 2021

Description

Added a function for pyOpenMS MSExperiment to export peak data to MS1 and MS2 data frames in a format required by MassQL.

Please include a summary of the change and which issue is fixed.

Checklist:

  • Make sure that you are listed in the AUTHORS file
  • Add relevant changes and new features to the CHANGELOG file
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes
  • Updated or added python bindings for changed or new classes. (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI:

If your PR is failing you can check out

Note:

  • Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

Advanced commands (admins / reviewer only):

  • /rebase will try to rebase the PR on the current develop branch.
  • /reformat (experimental) applies the clang-format style changes as additional commit
  • setting the label NoJenkins will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)

else:
ms2_data = ()
for i, peak in enumerate(spec):
peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60.0, get_polarity(spec))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a numpy function getPeakData... could this be used to make this faster?

Copy link
Contributor

@jpfeuffer jpfeuffer Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy

Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intersting, this would create the ndarray for the spectrum without numpy.asarray from a list of python lists. Will try...

@timosachsenberg
Copy link
Contributor

I also added @enetz he has more experience with pandas and numpy so he might spot some low hanging performance optimizations.

def get_spec_arr(spec, scan_num):
arr = np.asarray([peak_data for peak_data in get_peak_arrays_from_spec(spec, scan_num)], dtype='f')
arr = np.insert(arr, 1, arr[:,0]/np.amax(arr[:,0]), axis=1) # i_norm
arr = np.insert(arr, 2, arr[:,0]/np.sum(arr[:,0]), axis=1) # tic_i_norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be made better readable by making a pd.DataFrame earlier on.
you can then use e.g.
arr['i_norm'] = arr[: , 0]/...
to add new columns and use pd.concat to stack them later on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pandas at this point was significantly slower then numpy, maybe you can check the code I posted below if that is what you had in mind?


def get_peak_arrays_from_spec(spec, scan_num):
if spec.getMSLevel() == 2:
ms2_data = (spec.getPrecursors()[0].getMZ(), self.getPrecursorSpectrum(scan_num)+1, spec.getPrecursors()[0].getCharge())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warn if there are multiple precursors? And maybe store the temporary first precursor one line above. Not sure if python is smart enough to make that optimization.

else:
ms2_data = ()
for i, peak in enumerate(spec):
peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
Copy link
Contributor

@jpfeuffer jpfeuffer Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, extract the full numpy arrays for mz and inty, generate equally sized numpy arrays for RT/60, scannum+1 and polarity (e.g. a=np.empty(len(mzvals)); a.fill(RT/60)), concatenate rowwise and then transpose.
Should be fastest, since all is in numpy

Maybe add two more np.empty rows for i_norm and tic_i_norm. So you just have to fill it later and don't need to use costly insert.

@jpfeuffer
Copy link
Contributor

Cool idea, see my comments for optimization. In the best case, keep track of times after every change you make ;)
Mine are mostly theoretical ideas.

@axelwalter
Copy link
Collaborator Author

Thanks for the review! Very interesting. I tried some suggestions for optimization, like using pandas earlier with this code:

    def get_peak_arrays_from_spec(spec, scan_num):
        if spec.getMSLevel() == 2:
            ms2_data = (spec.getPrecursors()[0].getMZ(), self.getPrecursorSpectrum(scan_num)+1, spec.getPrecursors()[0].getCharge())
        else:
            ms2_data = ()
        for peak in spec:
            peak_data = (peak.getIntensity(), peak.getMZ(), scan_num+1, spec.getRT()/60, get_polarity(spec))
            yield peak_data + ms2_data

    def get_spec_df(mslevel, dtypes):
        for scan_num, spec in enumerate(self):
            if spec.getMSLevel() == mslevel:
                df = pd.DataFrame(np.fromiter(get_peak_arrays_from_spec(spec, scan_num), dtype=dtypes))
                # df.insert() not slower then df['i_norm'] = ... (and have to reorder columns later)
                df.insert(1, 'i_norm', df['i']/df['i'].max()) # i_norm
                df.insert(2, 'i_tic_norm', df['i']/df['i'].sum()) # tic_i_norm
                yield df

    dtypes = [('i', 'float32'), ('mz', 'float64'), ('scan', 'int32'), ('rt', 'float32'), ('polarity', 'int32')]
    if 1 in self.getMSLevels():
        ms1_df = pd.concat(list(get_spec_df(1, dtypes)))
    else:
        ms1_df = pd.DataFrame(columns=[x[0] for x in dtypes])

    dtypes += [('precmz', 'float64'), ('ms1scan', 'int32'), ('charge', 'int32')]
    if 2 in self.getMSLevels():
        ms2_df = pd.concat(list(get_spec_df(2, dtypes)))
    else:
        ms2_df = pd.DataFrame(columns=[x[0] for x in dtypes])

    return ms1_df, ms2_df

However, this was significantly slower (about 10 seconds vs 2 seconds) so using numpy as much as possible seems to be working fine. Compared to file loading and exporting of data frames by the MassQL library the solution is also significantly faster (about 50%).

Added some more docs and changed the code to be more understandable, but was not able to get better run time compared to before.

But will still test some more of your suggestions now.

@jpfeuffer
Copy link
Contributor

By the way, do I see it correctly that we are basically are exporting every single peak in the experiment, only split into a Ms1 and an Ms2 table?

I think then the ultimate solution would be to add an Ms_level parameter to our new getPeakData function in C++ (the one that @timosachsenberg did). I would love that addition anyway, in case someone loads a full experiment but only wants Ms1 peaks for plotting. I can take a short look.

@axelwalter
Copy link
Collaborator Author

Yes that's basically it. Since we calculate the normalized intensities based on the spectrum, does it make sense to collect the data by iterating over the spectra? Because otherwise we have to split the larger array later for the calculations.

@jpfeuffer
Copy link
Contributor

jpfeuffer commented Dec 23, 2021

Yeah exactly. I was thinking of just applying that function twice. Once for Ms1 and once for Ms2. I think in this case iteration is faster than memory management between c and python.
EDIT: I just found out, that the current get2DPeakData only exports MS1, due to the usage of AreaIterator. So I will just add another function to get MS2 peaks.

If we assume a fixed number of levels, we could even return a tuple of e.g. nine arrays:
Ms1 int,mz,rt, Ms2 int,mz,rt Ms3 int,mz,rt
This should be super fast 😂 but not sure if worth it.

@jpfeuffer
Copy link
Contributor

#5726

Then you can use getMS2PeakData for MS2 peaks and get2DPeakData (with the full range) for MS1 peaks.
The rest should be straightforward. No need for loops or anything.

@axelwalter
Copy link
Collaborator Author

In this case you would still need that right? Because we need collect the MS2 data (precursor mz, ms1scan, ...) and calculate normalizations for each spectrum.

@axelwalter
Copy link
Collaborator Author

With the new changes the computation takes only 0.24 s compared to 2 s with my test file.

@timosachsenberg
Copy link
Contributor

I think we might want to add a parameter for the level here: #5726

@timosachsenberg
Copy link
Contributor

would be good to have a test output to track regressions

@timosachsenberg
Copy link
Contributor

sorry but the test file is too big (10 mb tsv). Needs to be smaller than 1mb. e.g., one of the BSA fractions might be still result in too big tsv file (https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/FRACTIONS/BSA1_F1.mzML) .
e.g. https://github.com/OpenMS/OpenMS/blob/develop/src/tests/topp/FileFilter_1_input.mzML could be of better size

@axelwalter axelwalter merged commit beeda3b into OpenMS:develop Dec 28, 2021
@axelwalter axelwalter deleted the feature/massql-export branch December 28, 2021 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants