Wingbeat analysis¶

This notebook demonstrates how to analyse wingbeat data output from camfi.

First, load the required libraries.

from math import sqrt
from pathlib import Path

from matplotlib import pyplot as plt
import numpy as np
from scipy import stats

from camfi.projectconfig import CamfiConfig
from camfi.plotting.matplotlib import MatplotlibWingbeatFrequencyPlotter
from camfi.wingbeat import BcesEM, GMM, WeightedGaussian

To run via_project.load_all_exif_metadata and wingbeat_extractor.extract_wingbeats() below, you will first need to download the dataset 2019-11_cabramurra.zip from the Zenodo repository. The link to the repository is here: https://doi.org/10.5281/zenodo.4950570.

If you uncomment the code in the next three code cells, it is assumed you have extracted the images to "data/". Of course you can extract it elsewhere and change root config variable accordingly.

"data/cabramurra_all_annotations.json" already contains all the wingbeat data we need, so in this instance, we don’t need to re-run the time-consuming metadata and wingbeat extraction code. Therefore, we don’t need the image files.

The next cell loads the annotations, and the location and time data used to correct the camera timestamps, which can often be out (either if the camera was set incorrectly, or if it just has an inaccurate clock… which is often the case for cheap cameras).

config_path = "data/cabramurra_config.yml"

config = CamfiConfig.parse_yaml_file(config_path)

# We can print out our config using config.json() or config.yaml()
print(config.json(exclude_unset=True, indent=2))

{
  "root": "data",
  "via_project_file": "data/cabramurra_all_annotations.json",
  "day_zero": "2019-01-01",
  "output_tz": "+10:00",
  "filters": {
    "image_filters": {
      "min_annotations": 1
    }
  },
  "camera": {
    "camera_time_to_actual_time_ratio": 1.0,
    "line_rate": 90500.0
  },
  "time": {
    "camera_placements": {
      "2019-11_cabramurra/0004": {
        "camera_start_time": "2019-10-14T13:00:00+11:00",
        "actual_start_time": "2019-11-14T13:00:00+11:00",
        "location": "cabramurra"
      },
      "2019-11_cabramurra": {
        "camera_start_time": "2019-11-14T13:00:00+11:00",
        "location": "cabramurra"
      }
    }
  },
  "place": {
    "locations": [
      {
        "name": "cabramurra",
        "lat": -35.9507,
        "lon": 148.3972,
        "elevation_m": 1513.9,
        "tz": "+10:00"
      }
    ],
    "weather_stations": [
      {
        "location": {
          "name": "cabramurra_smhea_aws_072161",
          "lat": -35.94,
          "lon": 148.38,
          "elevation_m": 1482.4,
          "tz": "+10:00"
        },
        "data_file": "data/cabramurra_bom_weather_201911.csv"
      }
    ],
    "location_weather_station_mapping": {
      "cabramurra": "cabramurra_smhea_aws_072161"
    }
  },
  "wingbeat_extraction": {
    "device": "cpu",
    "scan_distance": 50
  },
  "annotator": {
    "crop": {
      "x0": 0,
      "y0": 0,
      "x1": 4608,
      "y1": 3312
    },
    "training": {
      "mask_maker": {
        "shape": [
          3312,
          4608
        ],
        "mask_dilate": 5
      },
      "min_annotations": 1,
      "max_annotations": 50,
      "test_set_file": "data/cabramurra_test_set.txt",
      "device": "cuda",
      "batch_size": 5,
      "num_workers": 2,
      "num_epochs": 15,
      "outdir": "data",
      "save_intermediate": true
    },
    "inference": {
      "output_path": "data/cabramurra_autoannotated.json",
      "device": "cuda",
      "backup_device": "cpu",
      "score_thresh": 0.0
    },
    "validation": {
      "autoannotated_via_project_file": "data/cabramurra_autoannotated.json",
      "image_sets": [
        "all",
        "test",
        "train"
      ],
      "output_dir": "data"
    }
  }
}

To get the timestamps for the images, we need to read the EXIF metadata from the image files. Here we also apply time correction. The code is commented out since the metadata has already been loaded into "data/cabramurra_all_annotations.json", but if you are working with a different dataset, or would like to re-run IO intensive this step, uncomment the code.

# Uncomment if exif metadata hasn't been loaded already.
# config.load_all_exif_metadata()

After the EXIF metadata has been loaded, we can run the camfi algorithm to measure the wingbeat frequencies of moths seen in the images. Again, this has already been run and the data is included in "data/cabramurra_all_annotations.json", so only uncomment if you have downloaded the image dataset and want to re-run (or you are running on your own dataset). This step may take a while to run.

Note: This step can be accelerated using a GPU. If you have one on your system, consider setting "wingbeat_extraction":{"device":"cuda","backup_device":"cpu"} in data/cabramurra_config.json.

# Uncomment if wingbeat data hasn't been extracted already
# config.extract_all_wingbeats()

After running the above two steps, you might like to save the results to a new VIA project file. Uncommenting the following will save a new VIA project file to "data/all_annotations_with_wingbeats.json".

# with open("data/all_annotations_with_wingbeats.json", "w") as f:
#     f.write(config.via_project.json(indent=2, exclude_unset=True))

The camfi.datamodel.via.ViaProject class is useful for loading and validating files which are compatible with VIA, however for some analyses it is more convenient to have a Pandas DataFrame. The to_region_dataframe method of camfi.datamodel.via.ViaProject makes this conversion simple.

regions = config.via_project.to_region_dataframe()
regions

	img_key	filename	name	datetime_corrected	datetime_original	exposure_time	location	pixel_x_dimension	pixel_y_dimension	score	best_peak	blur_length	snr	wb_freq_up	wb_freq_down	et_up	et_dn
0	2019-11_cabramurra/0001/DSCF0009.JPG-1	2019-11_cabramurra/0001/DSCF0009.JPG	polyline	2019-11-14 20:20:26+11:00	2019-11-14 20:20:26	0.111111	cabramurra	4608	3456	None	111.0	536.292725	12.966407	44.505436	41.726944	0.107531	0.114691
1	2019-11_cabramurra/0001/DSCF0010.JPG-1	2019-11_cabramurra/0001/DSCF0010.JPG	polyline	2019-11-14 20:30:29+11:00	2019-11-14 20:30:29	0.111111	cabramurra	4608	3456	None	237.0	1008.291016	5.783094	40.686996	35.853527	0.104095	0.118128
2	2019-11_cabramurra/0001/DSCF0010.JPG-1	2019-11_cabramurra/0001/DSCF0010.JPG	polyline	2019-11-14 20:30:29+11:00	2019-11-14 20:30:29	0.111111	cabramurra	4608	3456	None	165.0	675.603577	21.689453	36.698574	36.494766	0.110802	0.111421
3	2019-11_cabramurra/0001/DSCF0010.JPG-1	2019-11_cabramurra/0001/DSCF0010.JPG	polyline	2019-11-14 20:30:29+11:00	2019-11-14 20:30:29	0.111111	cabramurra	4608	3456	None	116.0	660.467407	6.046125	51.069618	50.624634	0.110625	0.111597
4	2019-11_cabramurra/0001/DSCF0010.JPG-1	2019-11_cabramurra/0001/DSCF0010.JPG	circle	2019-11-14 20:30:29+11:00	2019-11-14 20:30:29	0.111111	cabramurra	4608	3456	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1414	2019-11_cabramurra/0010/DSCF0747.JPG-1	2019-11_cabramurra/0010/DSCF0747.JPG	point	2019-11-24 23:21:23+11:00	2019-11-24 23:21:23	0.100000	cabramurra	4608	3456	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1415	2019-11_cabramurra/0010/DSCF0777.JPG-1	2019-11_cabramurra/0010/DSCF0777.JPG	polyline	2019-11-25 04:22:54+11:00	2019-11-25 04:22:54	0.100000	cabramurra	4608	3456	None	107.0	520.138428	11.467738	48.212074	48.084389	0.099867	0.100133
1416	2019-11_cabramurra/0010/DSCF0779.JPG-1	2019-11_cabramurra/0010/DSCF0779.JPG	polyline	2019-11-25 04:43:00+11:00	2019-11-25 04:43:00	0.100000	cabramurra	4608	3456	None	82.0	389.173492	3.202193	47.306557	46.436455	0.099072	0.100928
1417	2019-11_cabramurra/0010/DSCF0780.JPG-1	2019-11_cabramurra/0010/DSCF0780.JPG	polyline	2019-11-25 04:53:04+11:00	2019-11-25 04:53:04	0.100000	cabramurra	4608	3456	None	129.0	591.514160	5.017477	46.047268	45.040775	0.098895	0.101105
1418	2019-11_cabramurra/0010/DSCF0851.JPG-1	2019-11_cabramurra/0010/DSCF0851.JPG	point	2019-11-26 04:43:00+11:00	2019-11-26 04:43:00	0.100000	cabramurra	4608	3456	None	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1419 rows × 17 columns

Next we want proceed with just polyline annotations (as these are the only type which enable wingbeat measurement), and we also want to filter the data by an SNR threshold.

snr_thresh = 4.0
polyline_regions = regions[regions["name"] == "polyline"]
above_thresh = polyline_regions[polyline_regions["snr"] >= snr_thresh]
len(above_thresh)

We can already visualise the data, with the SNR threshold indicated by a red line.

plotter = MatplotlibWingbeatFrequencyPlotter(
    polyline_regions=polyline_regions,
    snr_thresh=snr_thresh,
)
fig = plotter.plot()

../../_images/wingbeat_analysis_15_0.png

With zero bodylength assumption¶

Based on the above plots, we now wish to select a number of target classes. This will be used for fitting a Gaussian mixture-model to the preliminary winbeat frequencies (which assume zero bodylength), and for classification in the linear regression step for the unbiased estimate of wingbeat frequency.

n_classes = 2

Operating on the preliminary wingbeat data (figure a), we can fit a Gaussian mixture-model.

gmm = GMM.log10_from_region_dataframe(
    above_thresh,
    n_classes,
    seed=1234567890  # Not reuired to set, but makes the results
                     # reproducible
)
gmm_results = sorted(gmm.fit())  # Order of classes is random, so we sort
                                 # to make it predictable.

print("log10 Gaussian Mixture Model parameters:")
print("\n".join(str(r) for r in gmm_results))

log10 Gaussian Mixture Model parameters:
mean=1.3968704095248994 std=0.0755076055338635 weight=0.14025772662404282
mean=1.6913164856912772 std=0.07106824294152196 weight=0.8597422733759577

To get mean and standard deviation in Hz, we need to re-parametrise.

for weighted_gaussian in gmm_results:
    mean_hz, std_hz = weighted_gaussian.get_natural_params_from_log(10.0)
    print(f"μ = {mean_hz} Hz  σ = {std_hz}  weight = {weighted_gaussian.weight}")

μ = 25.102738892471677 Hz  σ = 2.8856693084544567  weight = 0.14025772662404282
μ = 49.41306955900124 Hz  σ = 5.34427937939624  weight = 0.8597422733759577

The estimate of standard deviation includes the measurement error associated with exposure time ambiguity. Here we will attempt to make an estimate of the underlying wingbeat frequency standard deviation.

First, we need to check that measurement error is not correlated with log10(wingbeat frequency)

log10_means = (np.log10(above_thresh.wb_freq_down) + np.log10(above_thresh.wb_freq_up)) / 2
log10_errors = log10_means - np.log10(above_thresh.wb_freq_down)

densities = np.stack([stats.norm.pdf(log10_means, loc=r.mean, scale=r.std) for r in gmm_results])
gmm_classification_prediction = densities.argmax(axis=0)

fig = plt.figure()
ax = fig.add_subplot(
    111,
    xlabel="log10(wingbeat frequency)",
    ylabel="Direction ambiguity measurement error",
)
ax.scatter(
    log10_means,
    log10_errors,
    marker="x",
    c=np.array(["tab:blue", "tab:green"])[gmm_classification_prediction],
)

for i in range(n_classes):
    class_mask = gmm_classification_prediction == i
    r, p = stats.pearsonr(log10_means[class_mask], log10_errors[class_mask])
    print(f"class {i}:  r = {r}  p(|R| >= |r|) = {p}")

class 0:  r = 0.1120666137470634  p(|R| >= |r|) = 0.3042812058830173
class 1:  r = 0.10711341709033767  p(|R| >= |r|) = 0.017240476736024066

../../_images/wingbeat_analysis_25_1.png

It appears that measurement error is not correlated to measurement value, so we can proceed using the relation

\[Var(X + Y) = Var(X) + Var(Y)\]

log10_wingbeat_gaussians = []

for i in range(n_classes):
    class_mask = gmm_classification_prediction == i
    error_var = np.var(log10_errors[class_mask], ddof=1)
    log10_wingbeat_var = gmm_results[i].std ** 2 - error_var
    log10_wingbeat_gaussians.append(
        WeightedGaussian(
            mean=gmm_results[i].mean,
            std=np.sqrt(log10_wingbeat_var),
            weight=gmm_results[i].weight,
        )
    )

for weighted_gaussian in log10_wingbeat_gaussians:
    mean_hz, std_hz = weighted_gaussian.get_natural_params_from_log(10.0)
    print(f"μ = {mean_hz} Hz  σ = {std_hz}  weight = {weighted_gaussian.weight}")

μ = 25.101781030581876 Hz  σ = 2.8771046444652533  weight = 0.14025772662404282
μ = 49.403384640368735 Hz  σ = 5.25188370862169  weight = 0.8597422733759577

We can set the gmm_results parameter to plot the figure with the Gaussian mixture model shown.

plotter = MatplotlibWingbeatFrequencyPlotter(
    polyline_regions=polyline_regions,
    snr_thresh=snr_thresh,
    gmm_results=gmm_results,
)
fig = plotter.plot()

../../_images/wingbeat_analysis_29_0.png

Without zero bodylength assumption¶

Now we use an EM algorithm to classify the data using BCES regressions of \(L\) vs. \(P \Delta t\).

bces_em = BcesEM.from_region_dataframe(
    above_thresh, n_classes, seed=1234567890
)
bces_results = bces_em.fit()

The order of the classes is random, so we sort to make them predictable (and line up better with the sorted classes from the GMM). This will make colouring comparable across the two models. Unfortunately we can’t just sort bces_results, because then the colours of the scatter plot wouldn’t be guaranteed to match the regression lines. We have to also re-map the bces_em.class_mask values.

This can do this re-mapping with some indexing trickery using np.argsort, and then we can sort bces_results the normal way.

inverse_index = np.argsort(np.argsort(bces_results))
class_mask = inverse_index[bces_em.class_mask]
class_n = np.bincount(class_mask)
bces_results = sorted(bces_results)

print("Classifications:")
print("\n".join(str(n) for n in class_n))
print("Multiple BCES linear regression parameters:")
print("\n".join(str(b) for b in bces_results))

Classifications:
75
505
Multiple BCES linear regression parameters:
gradient=23.684945219639722 y_intercept=21.164640791557133 gradient_stderr=1.8436575083709323 y_intercept_stderr=30.356672106881156 cov_xy=-54.37303718722761
gradient=48.648490469072115 y_intercept=30.410128999818426 gradient_stderr=1.4469572965456388 y_intercept_stderr=18.927271182452444 cov_xy=-26.431054889875146

Testing zero bodylength assumption¶

We can test the zero body-length assumption by calculating the t-statistic of the intercept of the linear regression

\[t = \frac{\hat{\beta_0} - 0}{SE(\hat{\beta_0})}\]

where \(\hat{\beta_0}\) is the estimate of the intercept. We then can test the following hypotheses:

\[H_0 : \quad \hat{\beta_0} = 0 ,\]

\[H_1 : \quad \hat{\beta_0} > 0 .\]

We perform a one-sided test, since negative bodylength does not make physical sense (so there is no need for a two-sided test). Note that this is also more conservative, since a higher-powered test is more likely to reject \(H_0\), which would force us to reject the zero body-length assumption.

t_values = np.array([estimates.y_intercept / estimates.y_intercept_stderr for estimates in bces_results])
dfs = class_n - 2  # Two estimators per linear regression
p_values = stats.t.sf(t_values, dfs)
print(t_values)
print(p_values)

[0.69719898 1.60668322]
[0.24394638 0.05437571]

We see that for both regressions, we have insufficient evidence to reject the zero body-length assumption (\(p > 0.05\)). It is therefore reasonable to use the estimates given by the GMM approach.

for weighted_gaussian in log10_wingbeat_gaussians:
    mean_hz, std_hz = weighted_gaussian.get_natural_params_from_log(10.0)
    print(f"μ = {mean_hz} Hz  σ = {std_hz}  weight = {weighted_gaussian.weight}")

μ = 25.101781030581876 Hz  σ = 2.8771046444652533  weight = 0.14025772662404282
μ = 49.403384640368735 Hz  σ = 5.25188370862169  weight = 0.8597422733759577

Finally, we reproduce the figure from the publication, which includes both the GMM and EM classification

plotter = MatplotlibWingbeatFrequencyPlotter(
    polyline_regions=polyline_regions,
    snr_thresh=snr_thresh,          # Must be the same threshold used to
                                    # make the class_mask.
    class_mask=class_mask,          # Adds colour to the measurements
    gmm_results=gmm_results,        # Include the GMM plots
    bces_results=bces_results,      # Include the BCES regression lines
)
fig = plotter.plot()

../../_images/wingbeat_analysis_40_0.png

fig is just a matplotlib Figure instance, so we can save it quite easily.

fig.savefig("wingbeat_frequency_figure.pdf", dpi=600.0, pad_inches=0.0)