Distribution-fitting¶
-
incidentfitting.get_big_incident_arrival_dist(big_incidents)[source]¶ Get distributions of big incidents over months, days of the week, and hours.
Parameters: big_incidents (pd.DataFrame) – The incident data, filtered to only big incidents / output of get_big_incident_data.
-
incidentfitting.get_big_incident_data(incidents, deployments, types=None, vehicles=['TS'], min_ts=3)[source]¶ Filter incident and deployment data to those instances relating to a ‘big’ incident.
Parameters: - incidents (pd.DataFrame) – The incident data.
- deployments (pd.DataFrame) – The deployment data.
- types (list of strings, default=None) – The incident types to include, if None, use all in the data.
- vehicles (list of strings, default=["TS"]) – The vehicle types to take into account. Deployments of all other vehicle types will be dropped.
- min_ts (int, default=3) – The minimum number of TS deployments for an incident to be included in the result.
Returns: big_incidents, big_deployments – The filtered incident and deployment data (as a tuple).
Return type: pd.DataFrame
-
incidentfitting.get_big_incident_ids(deployments, min_ts=3)[source]¶ Find incidents that are have at least a specified number of TS deployments.
Parameters: - deployments (pd.DataFrame) – The deployment data.
- min_ts (int, default=3) – The minimum number of TS deployments for an incident to be classified as big.
Returns: ids – A list of incident IDs that had at least min_ts TS deployments.
Return type: list
-
incidentfitting.get_big_incident_type_dist(big_incidents, types=None)[source]¶ Get the distribution of big incidents over incident types.
Parameters: - big_incidents (pd.DataFrame) – The incident data, filtered to only big incidents / output of get_big_incident_data.
- types (list of strings, default=None) – The incident types to use. If None, use all.
-
incidentfitting.get_building_function_probabilities(incidents, location_col='hub_vak_bk', locations=None)[source]¶ Find the distribution of building functions per demand location.
Parameters: - incidents (pd.DataFrame) – The log of incidents to obtain building function distributions from.
- location_col (str) – The column name in ‘incidents’ that identifies the demand location.
Returns: - A nested dictionary like
- {‘location id’ -> {‘incident type’ -> {‘building function’ -> probability}}}.
-
incidentfitting.get_overall_building_dist(incidents, location_col='hub_vak_bk')[source]¶ Get aggregated building function distribution for a list of locations.
-
incidentfitting.get_prio_probabilities_per_type(incidents)[source]¶ Create dictionary with the probabilities of having priority 1, 2, and 3 for every incident type.
Parameters: incidents (pd.DataFrame) – Contains the log of incidents from which the probabilities should be obtained. Returns: - Dictionary with incident type names as keys and lists of length 3
- as elements, where probabilities of prio 1, 2, 3 are in position
- 0, 1, 2 respectively.
-
incidentfitting.get_spatial_distribution_per_type(incidents, location_col='hub_vak_bk', locations=None)[source]¶ Obtain the distribution over demand locations for every incident type.
Parameters: - incidents (pd.DataFrame) – The log of incidents to obtain probabilities from.
- location_col (str, default='hub_vak_bk') – The column in ‘incidents’ to use as identifier for demand location.
- locations (list(str), default=None) – The locations that should be present in the result. If None, only incorporates the locations that have had incidents in the past for the concerning incident type.
Returns: Dictionary like `{“type”
Return type: {“location”: probability}}`.
-
incidentfitting.get_vehicle_requirements_probabilities(incidents, deployments, vehicles)[source]¶ Calculate the probabilities of needing a number of vehicles of a specific type for a specified incident type.
Parameters: - incidents (pd.DataFrame,) – The log of incidetns to extract probabilities from.
- deployments (pd.DataFrame,) – The log of deployments to extract probabilities from.
- vehicles (list) – The vehicle types to take into account.
Returns: Nested dictionary like {“incident type”
Return type: {“vehicles”: prob}}.
-
incidentfitting.infer_types(data)[source]¶ Infer incident types from an incident log.
Parameters: data (pd.DataFrame) – The incident data. Must contain the ‘dim_incident_incident_type’ column. Returns: types – The incident types found in the data. Return type: list of strings Notes
Excludes ‘NVT’ and ‘nan’ from the resulting list.
-
incidentfitting.prepare_incidents_for_spatial_analysis(incidents)[source]¶ Perform initial preprocessing tasks before fitting parameters and obtaining probabilities from the incident data.
Parameters: incidents (pd.DataFrame) – The incident data to prepare. Notes
- Some tasks to perform before fitting:
- Remove NaNs in location and building function
- Cast or load location column as int->string
- remove incidents outside AA
- …
Returns: Return type: The prepared DataFrame.
-
responsetimefitting.add_osrm_distance_and_duration(df, osrm_host='http://192.168.56.101:5000')[source]¶ - Calculate distance and duration over the road from station to incident
- for every incident in the data.
Parameters: - df (DataFrame) – The merged data of incidents and deployments. Must contain the following columns: {station_longitude, station_latitude, incident_longitude, incident_latitude}.If not present, call ‘prepare_data_for_response_time_analysis’ first.
- osrm_host (str) – The URL to the OSRM API, defaults to ‘http://192.168.56.101:5000’, which is the default if running OSRM locally.
Returns: - The DataFrame with two added columns ‘osrm_distance’ (meters) and ‘osrm_duration’
- (seconds).
Notes
Requires OSRM to be installed (an optional dependency of fdsim).
-
responsetimefitting.add_parttime_fulltime_indicator(data, station_col='inzet_kazerne_groep', volunteer_stations=None)[source]¶ Add a column to the data, indicating whether it is a fulltime manned station or a parttime (volunteer) station.
Parameters: - data (pd.DataFrame) – The data to add the column to.
- station_col (str, optional (default: "inzet_kazerne_groep")) – The column indicating the station responsible for the deployment.
- volunteer_stations (array-like of strings, optional,) – The station names (all uppercases) of the stations that are parttime.
Returns: data – The data with an added boolean column “fulltime”.
Return type: pd.DataFrame
-
responsetimefitting.fit_big_incident_duration(big_incidents)[source]¶ Fit a Gamma random variable on the duration of big incidents.
Parameters: big_incidents (pd.DataFrame) – The incident data, filtered to only big incidents / output of fdsim.incidentfitting.get_big_incident_data. Returns: duration_rv – A random variable describing the distribution of big incident durations. Return type: scipy.stats.gamma frozen distribution
-
responsetimefitting.fit_dispatch_times(data, rough_upper_bound=600)[source]¶ - Fit a lognormal random variable to the dispatch time per
- incident type.
Parameters: - data (DataFrame) – Merged log of deployments and incidents. All deployments in the data will be used for fitting, so any filtering (e.g., on priority or ‘volgnummer’) must be done in advance.
- rough_upper_bound (int) – Number of seconds to use as a rough upper bound filter, dispatch times above this value are considered unrealistic/unreliable and are removed before fitting. DEfaults to 600 seconds (10 minutes).
Returns: Return type: A dictionary like {‘incident type’ -> ‘scipy.stats.lognorm object’}.
-
responsetimefitting.fit_gamma_rv(x, **kwargs)[source]¶ - Fit a Gamma distribution to data and return a fitted
- random variable that can be used for sampling.
Parameters: - x (array-like) – The data to fit.
- **kwargs (additional arguments passed to scipy.stats.gamma.fit()) –
Returns: Return type: The fitted scipy.stats.gamma object.
-
responsetimefitting.fit_lognorm_rv(x, **kwargs)[source]¶ - Fit a lognormal distribution to data and return a fitted
- random variable that can be used for sampling.
Parameters: - x (array-like) – The data to fit.
- **kwargs (additional arguments passed to scipy.stats.lognorm.fit()) –
Returns: Return type: The fitted scipy.stats.lognorm object.
-
responsetimefitting.fit_onscene_times(data, vehicles=['TS', 'HV', 'RV', 'WO'], rough_lower_bound=60, rough_upper_bound=86400)[source]¶ - Fit a lognormal random variable to the dispatch time per
- incident type.
Parameters: - data (DataFrame) – Merged log of deployments and incidents. All deployments in the data will be used for fitting, so any filtering (e.g., on priority or ‘volgnummer’) must be done in advance.
- vehicles (array-like of strings) – The vehicles to fit on-scene times for. Optional, defaults to [“TS”, “HV”, “RV”, “WO”].
- rough_lower_bound (int) – Number of seconds to use as a rough lower bound filter, on-scene times below this value are considered unrealistic/unreliable and are removed before fitting. Defaults to 60 seconds.
- rough_upper_bound (int) – Number of seconds to use as a rough upper bound filter, on-scene times above this value are considered unrealistic/unreliable and are removed before fitting. Defaults to 24*60*60 seconds (24 hours).
Returns: - A dictionary like
- {‘incident type’ -> {‘vehicle type’ -> ‘scipy.stats.gamma object’}}.
-
responsetimefitting.fit_simple_linear_regression(data, xcol, ycol, fit_intercept=False)[source]¶ Fit simple linear regression on the data.
Parameters: - data (DataFrame) – The data to fit a model on.
- xcol (str) – The name of the column that acts as a predictor.
- ycol (str) – The name of the column acting as the dependent variable.
- fit_intercept (boolean) – If true, also fits the intercept. If false, forces intercept of the resulting model to 0. NOTE: Defaults to false.
Returns: Return type: Parameters of fitted model, i.e., a tuple of (intercept, coefficient)
-
responsetimefitting.fit_turnout_times(data, prios=[1, 2, 3], vehicle_types=['TS', 'RV', 'HV', 'WO'], rough_lower_bound=30, rough_upper_bound=600, stations_to_exclude=None, station_col='inzet_kazerne_groep', volunteer_stations=None)[source]¶ Fit a lognormal random variable to the turn-out time per appointment (fulltime/parttime), priority, and vehicle type.
Parameters: - data (DataFrame) – Merged log of deployments and incidents. All deployments in the data will be used for fitting, so any filtering (e.g., on priority or ‘volgnummer’) must be done in advance.
- prios (array-like of int, optional (default: [1, 2, 3])) – The priority levels to fit turnout times for.
- rough_upper_bound (rough_lower_bound,) – Number of seconds to use as a rough lower and upper bound filter, turn-out times outside these value are considered unrealistic/unreliable and are removed before fitting. Defaults to 600 seconds (10 minutes) and 30 seconds.
- stations_to_exclude (array-like of str, optional (default: None)) – Stations to remove from the data before fitting. Some stations may imply invalid deployments (e.g, “Regio”, “Onbekend”), which could influence the turnout times.
- station_col (str, optional (default: "inzet_kazerne_groep")) – The column in data that holds the station responsible for the deployment.
- volunteer_stations (array-like of str, optional (default: None)) – The names of the stations that are run by volunteers. These are fitted separately from full time stations.
Returns: - A dictionary like {‘prio’ -> {‘parttime’ ->
- ’scipy.stats.gamma object’, ‘fulltime’ -> ‘scipy.stats.gamma object’}}.
-
responsetimefitting.get_coordinates_locations_stations(data, location_col='hub_vak_bk')[source]¶ Obtain the coordinates of the demand locations and stations.
Parameters: - data (pd.DataFrame) – Merged and preprocessed data (result from ‘prepare_data_for_response_time_analysis’).
- location_col (str, column name of data) – The column to use as identifier for the (demand) location.
- custom_station_locations (array-like of strings) – Identifiers of the locations (in data[location_col]) of the stations in case the custom station locations should be used. If provided, does not use the stations in the data.
Notes
Assumes data has the following columns for coordinates: ‘incident_longitude’, ‘incident_latitude’, ‘station_longitude’, ‘station_latitude’. Data must be in the desired coordinate system already.
Returns: - Two dictionaries. The first contains demand location coordinates
- {‘location id’ -> (longitude, latitude)}
- The second holds the coordinates of the stations
- {‘station name’ -> (longitude, latitude)}
- In case of custom station locations, the station name is replaced with an arbitraty
- identifier.
-
responsetimefitting.get_osrm_distance_and_duration(longlat_origin, longlat_destination, osrm_host='http://192.168.56.101:5000')[source]¶ - Calculate distance over the road and normal travel duration from
- one point to the other.
Parameters: - longlat_origin (tuple(float, float)) – coordinates of the start location in decimal longitude and latitude (in that order).
- longlat_destination (tuple(float, float)) – coordinates of the destination in decimal longitude and latitude.
- osrm_host (str) – The URL to the OSRM API.
Returns: Return type: Tuple of (‘distance’, ‘duration’) according to OSRM.
Notes
Requires OSRM to be installed (an optional dependency of fdsim).
-
responsetimefitting.model_noise_travel_time(y, x, a, b)[source]¶ - Fit a random variable to the residual of simple linear regression
- on the travel time.
Parameters: - y (array-like) – The values to predict / simulate.
- x (array-like, same shape as y) – The independent variable that partially explains y.
- a (float,) – The intercept of the linear model $y ~ a + b*x$.
- b (float) – The coefficient of the linear model $y ~ a + b*x$.
Returns: - A Lognormally distributed random variable (scipy.stats.lognorm) fitted
- on the residual ($y - (a + bx)$.)
-
responsetimefitting.model_travel_time(data)[source]¶ - Model the travel time as a function of the estimated travel time
- from OSRM.
Parameters: data (DataFrame) – Output of ‘prepare_data_for_response_time_analysis’. Returns: - Tuple of (intercept, coefficient, residual random variable), where
- the intercept and coefficient form a linear model predicting the
- travel time based on the OSRM estimated travle duration and the
- random variable is a scipy.stats.lognorm object explaining the residual
- after prediction. The results can be used to simulate travel times for
- arbitrary incidents.
-
responsetimefitting.model_travel_time_per_vehicle(data)[source]¶ Model the travel time for every vehicle type separately.
Parameters: data (pd.DataFrame) – Output of ‘prepare_data_for_response_time_analysis’. Returns: - Dictionary like
- {‘vehicle’ -> {‘a’ -> intercept, – ‘b’ -> coefficient, ‘noise_rv’ -> random variable for noise}}
- See ‘model_travel_time’ for details on those variables.
-
responsetimefitting.prepare_data_for_response_time_analysis(incidents, deployments, stations, vehicles)[source]¶ Prepare data for fitting dispatch, turnout, and travel times.
Parameters: - incidents (pd.DataFrame) – Contains the log of incidents.
- deployments (pd.DataFrame) – Contains the log of deployments.
- stations (pd.DataFrame) – Contains information on the fire stations.
- vehicles (array-like of strings) – The vehicles to keep in the resulting data. Must correspond to entries in the ‘voertuig_groep’ column in ‘incidents’.
Returns: Return type: The merged and preprocessed DataFrame.
-
responsetimefitting.robust_remove_travel_time_outliers(data)[source]¶ - Remove outliers in travel time in a robust way by looking at
- the 50% most reliable data points.
Performs two one-way outlier detection methods. It determines tresholds average speed and on time per distance unit (the inverse of speed) and cuts the values falling off on the high side. Tresholds are computed as follows:
$$limit = 75% quantile + 1.5*(75% quantile - 25% quantile)$$
Thus only data points between the 25% and 75% quantiles are used, which are likely to be reliable points. This makes the method robust against unreliable data.
Parameters: data (DataFrame) – The data to remove outliers from. Assumes the columns ‘osrm_distance’ and ‘inzet_rijtijd’ to be present. Returns: - tuple of (filtered DataFrame, minimum speed, maximum speed), where speed
- is in kilometers per hour.
-
responsetimefitting.safe_bayes_mvs(x, alpha=0.9)[source]¶ Bayesian confidence intervals for mean and standard deviation.
Parameters: - x (array-like) – The data to compute confidence intervals over
- alpha (float) – The confidence level, defaults to 0.9 (90% confidence)
Returns: - Tuple of (lower bound mean, upper bound mean,
- lower bound std, upper bound std).
-
responsetimefitting.sample_size_sufficient(x, alpha=0.95, max_mean_range=30, max_std_range=25)[source]¶ - Determines if sample size is sufficient based on Bayesian
- confidence intervals and tresholds on the maximum range.
Parameters: - x (array-like) – The data to evaluate.
- alpha (float in range (0,1)) – The confidence level, defaults to 0.95 (95%).
- max_mean_range (float or int) – the maximum range of the confidence interval for the mean to classify the sample size as sufficient.
- max_std_range (float or int) – the maximum range of the confidence interval for the standard deviation to classify the sample size as sufficient.
Returns: - True if the size of the confidence intervals are within the
- specified tresholds, False otherwise.