.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "_examples/ukvi-trips/plot_main01.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr__examples_ukvi-trips_plot_main01.py: UKVI trips visualisation ------------------------ .. GENERATED FROM PYTHON SOURCE LINES 6-285 .. image-sg:: /_examples/ukvi-trips/images/sphx_glr_plot_main01_001.png :alt: Voyage Durations (total abroad 535 days) :srcset: /_examples/ukvi-trips/images/sphx_glr_plot_main01_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Outbound Date Inbound Date Outbound Ports Inbound Ports Days Difference Voyage Code 0 2018-07-27 11:00:00 2018-07-30 21:25:00 LHR-FRA FRA-LHR 2 BA0904 1 2018-08-11 08:45:00 2018-08-25 13:25:00 STN-BLQ SKG-LGW 13 FR0194 2 2018-09-29 11:30:00 2018-10-03 16:40:00 LHR-FRA FRA-LHR 3 LH0905 3 2018-11-03 07:30:00 2018-11-06 14:45:00 LHR-FRA FRA-LHR 2 LH0923 4 2018-11-25 10:40:00 2018-12-01 13:50:00 STN-RAK RAK-STN 5 FR3556 5 2018-12-22 11:50:00 2018-12-26 22:40:00 STN-AOI AOI-STN 3 FR0124 6 2019-01-04 08:00:00 2019-01-08 11:30:00 STN-FRA FRA-LHR 3 FR1687 7 2019-01-21 08:00:00 2019-01-24 18:35:00 STN-FRA FRA-STN 2 FR1687 8 2019-02-09 08:05:00 2019-02-15 07:40:00 STN-BLQ BLQ-STN 4 FR0194 9 2019-02-20 17:35:00 2019-02-21 21:55:00 LGW-MXP MXP-LTN 0 U28197 10 2019-03-01 18:05:00 2019-03-15 06:25:00 LHR-JNB WDH-JNB 12 SA0235 11 2019-03-16 18:40:00 2019-03-20 07:40:00 STN-AOI BLQ-STN 2 FR0124 12 2019-03-26 08:05:00 2019-03-30 23:55:00 STN-BLQ AOI-STN 3 FR0194 13 2019-04-18 09:25:00 2019-04-22 20:21:00 LTN-AMS AMS-LDN 3 U22157 14 2019-04-27 06:25:00 2019-04-30 23:30:00 STN-AOI BLQ-STN 2 FR0124 15 2019-06-09 06:20:00 2019-06-15 22:00:00 STN-AOI RMI-STN 5 FR0124 16 2019-07-11 07:10:00 2019-07-15 13:30:00 LGW-BLQ BLQ-STN 3 U28989 17 2019-08-16 07:10:00 2019-08-22 08:05:00 LGW-BLQ BLQ-STN 5 U28989 18 2019-08-23 06:25:00 2019-09-02 23:55:00 STN-AOI BLQ-STN 9 FR0124 19 2019-11-16 11:35:00 2019-11-19 14:15:00 STN-BGY BLQ-LTN 2 FR4219 20 2019-12-22 08:05:00 2020-01-08 19:55:00 STN-BLQ BLQ-LHR 16 FR0194 21 2020-02-08 06:05:00 2020-02-10 21:30:00 LGW-ZRH ZRH-LTN 1 U28113 22 2020-02-20 09:15:00 2020-02-24 14:15:00 LTN-BLQ BLQ-LTN 3 FR3406 23 2020-03-04 06:30:00 2020-03-07 23:15:00 STN-ALC ALC-STN 2 FR8382 24 2020-07-26 14:35:00 2020-08-17 20:10:00 LHR-BLQ BLQ-LHR 21 BA0542 25 2020-09-06 14:35:00 2021-02-08 11:05:00 LHR-BLQ FCO-LHR 153 BA0542 26 2021-06-17 07:00:00 2021-09-25 12:05:00 STN-BLQ AOI-STN 99 FR0194 27 2021-11-15 17:25:00 2021-11-29 13:05:00 LTN-BLQ PEG-STN 12 FR3406 28 2022-01-04 08:05:00 2022-01-09 22:55:00 STN-BLQ AOI-STN 4 FR0194 29 2022-02-10 11:50:00 2022-02-22 07:40:00 STN-AOI BLQ-STN 10 FR0124 30 2022-03-24 09:50:00 2022-03-30 19:05:00 STN-TFS TFS-LGW 5 LS1663 31 2022-03-31 06:25:00 2022-04-03 11:25:00 STN-AOI AOI-STN 2 FR0124 32 2022-04-30 20:10:00 2022-05-02 21:20:00 LHR-BSL BSL-LHR 1 BA0756 33 2022-06-03 12:55:00 2022-06-14 12:05:00 STN-MAD PMI-STN 9 FR5996 34 2022-07-05 13:05:00 2022-07-27 15:45:00 STN-AOI BRI-LGW 21 FR0124 35 2022-09-13 13:05:00 2022-09-21 11:25:00 STN-AOI AOI-STN 6 FR0124 36 2022-10-06 16:15:00 2022-10-08 10:15:00 LGW-VRN VRN-LGW 0 U28449 37 2022-10-26 13:05:00 2022-12-26 15:25:00 STN-MAD KUL-LHR 60 FR5996 38 2023-01-13 06:15:00 2023-01-16 11:55:00 STN-BLQ AOI-STN 2 FR0194 39 2023-02-04 06:45:00 2023-02-06 16:10:00 LTN-ATH ATH-LGW 1 W94467 40 2023-02-21 16:05:00 2023-03-01 11:40:00 LTN-BLQ PEG-STN 6 FR3406 41 2023-04-21 18:35:00 2023-04-24 16:00:00 STN-AOI PEG-STN 1 FR0261 42 2023-04-28 14:45:00 2023-05-09 16:35:00 STN-LPA LPA-STN 10 FR2842 43 2023-05-28 17:10:00 2023-05-31 22:10:00 LHR-ZRH ZRH-LGW 2 LX0325 44 2023-06-09 11:10:00 2023-06-15 20:35:00 LGW-FCO AOI-STN 5 W45781 | .. code-block:: default :lineno-start: 7 import pdfplumber import re import pandas as pd import matplotlib.dates as mdates import matplotlib.pyplot as plt from matplotlib.dates import DateFormatter from pathlib import Path def extract_basic_travel_data(pdf_path, start_page, end_page): """Extract the data from a PDF file. Ensure that the format is appropriate, ent the column headers match those included below. Otherwise modify as appropriate. Parameters ---------- pdf_path: str The path to the file. start_page: int The start page where the table apperas. end_page: int The end page where the table appears. Returns ------- """ # Define the headers for essential data headers = [ "Departure Date/Time", "Arrival Date/Time", "Voyage Code", "In/Out", "Dep Port", "Arrival Port" ] travel_data = [] # Regex pattern to capture essential information row_pattern = re.compile( r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Departure Date/Time r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Arrival Date/Time r"(\S+)\s+" # Voyage Code r"(Outbound|Inbound)\s+" # In/Out r"(\S+)\s+" # Dep Port r"(\S+)" # Arrival Port ) # Open the PDF file and iterate over specified pages with pdfplumber.open(pdf_path) as pdf: for page_num in range(start_page - 1, end_page): page = pdf.pages[page_num] text = page.extract_text() if not text: continue # Match rows using the regex pattern matches = row_pattern.findall(text) if matches: for match in matches: travel_data.append(dict(zip(headers, match))) # Return return travel_data def combine_outbound_inbound(df): """Combine Outbound-Ibound rows into a single one. Paramters --------- df: pd.DataFrame The DataFrame with the data. Returns ------- pd.DataFrame """ # Convert date columns to datetime format df["Departure Date/Time"] = \ pd.to_datetime(df["Departure Date/Time"], format="%d/%m/%Y %H:%M") df["Arrival Date/Time"] = \ pd.to_datetime(df["Arrival Date/Time"], format="%d/%m/%Y %H:%M") # Sort the DataFrame by "Departure Date/Time" df = df.sort_values(by="Departure Date/Time").reset_index(drop=True) # Process the DataFrame result = [] for i in range(0, len(df) - 1, 2): # Step by 2 to handle consecutive rows outbound = df.iloc[i] inbound = df.iloc[i + 1] # Ensure the pair consists of an outbound followed by an inbound if outbound["In/Out"] == "Outbound" and inbound["In/Out"] == "Inbound": # Calculate the difference in days days_difference = (inbound["Arrival Date/Time"] - outbound["Departure Date/Time"]).days - 1 # Create a combined row with desired columns combined_row = { "Outbound Date": outbound["Departure Date/Time"], "Inbound Date": inbound["Arrival Date/Time"], "Outbound Ports": outbound["Dep Port"] + '-' + outbound["Arrival Port"], "Inbound Ports": inbound["Dep Port"] + '-' + inbound["Arrival Port"], "Days Difference": days_difference, "Voyage Code": outbound["Voyage Code"] } result.append(combined_row) # Return return pd.DataFrame(result) def display(df, cmap=None): """Plotting the graph. Parameters ---------- df: pd.DataFrame The pandas DataFrame. Returns ------- None """ # Set up plot fig, ax = plt.subplots(figsize=(16, 8)) # Fore each row (voyage) for i, row in df.iterrows(): if cmap is None: color = 'skyblue' else: cmap.get(row['Outbound Ports'].split('-')[1], 'skyblue') # Plot each voyage as a horizontal bar with text annotations ax.plot([row["Outbound Date"], row["Inbound Date"]], [i, i], marker='o', color=color, lw=6) # Formatting outbound and inbound dates outbound_str = row["Outbound Date"].strftime("%d %b") # Day and abbreviated month inbound_str = row["Inbound Date"].strftime("%d %b") # Day and abbreviated month # Adjust the text position to be further right ax.text(row["Inbound Date"] + pd.Timedelta(days=10), i - 0.05, # Increased offset to 10 days f"{row['Outbound Ports']} ({outbound_str}) to {row['Inbound Ports']} ({inbound_str}) | {row['Days Difference']} days", va='center', ha='left', fontsize=9, color="black") # Alternate month shading start_date = df["Outbound Date"].min().replace(day=1) end_date = df["Inbound Date"].max() current_date = start_date month = 0 while current_date < end_date: next_month = (current_date + pd.DateOffset(months=1)).replace(day=1) ax.axvspan(current_date, next_month, color='gray' if month % 2 == 0 else 'lightgray', alpha=0.2) current_date = next_month month += 1 # Add horizontal lines for each year years = pd.date_range(start=start_date, end=end_date+pd.DateOffset(years=1), freq='Y') for year in years: ax.axvline(year, color='black', linestyle='--', lw=1) # Vertical line for each year ax.text(year - pd.Timedelta(days=90), len(df) + 0.5, year.year, ha='left', va='center', fontsize=10, color='black') # Year label # Setting the x-axis limits to include full years full_start_date = pd.Timestamp(year=start_date.year, month=1, day=1) full_end_date = pd.Timestamp(year=end_date.year + 1, month=1, day=1) # Next January ax.set_xlim(full_start_date, full_end_date) # Set x-axis ticks to show full years from January to December ax.xaxis.set_major_locator(mdates.YearLocator()) # Major ticks at the beginning of each year ax.xaxis.set_minor_locator(mdates.MonthLocator()) # Minor ticks for each month ax.xaxis.set_major_formatter(DateFormatter("%Y")) # Year as the format for major ticks # Formatting the plot ax.set_yticks(range(len(df))) ax.set_yticklabels(df["Voyage Code"]) #ax.set_yticklabels(df['Days Difference']) ax.set_xlabel("Date") ax.set_title("Voyage Durations (total abroad %s days)" % df['Days Difference'].sum()) # Set x-axis ticks to show abbreviated month names and year ax.xaxis.set_major_locator(mdates.MonthLocator()) ax.xaxis.set_major_formatter(DateFormatter("%b %Y")) # Month abbreviation and year plt.xticks(rotation=45) plt.grid(axis='x', linestyle='--', alpha=0.5) plt.tight_layout() plt.show() # ------------------------------------------------------ # Main # ------------------------------------------------------ # Include any missing entry. This could happen if the travel # was done by bus or train, as only flights have been recorded # in the system. MISSING = { 'veronica': [ { "Departure Date/Time": "22/04/2019 18:21", "Arrival Date/Time": "22/04/2019 20:21", "Voyage Code": "BUS001", "In/Out": "Inbound", "Dep Port": "AMS", "Arrival Port": "LDN" } ] } # Include the colors desired for each airport. For example # they could be colored by country. COLORMAP = { 'FRA': 'black', 'BLQ': 'green', 'LHR': 'blue', 'LGW': 'blue', 'RAK': 'skyblue', 'STN': 'blue', 'AOI': 'green', 'MXP': 'green', 'LTN': 'blue', 'JNB': 'skyblue', 'AMS': 'skyblue', 'BGY': 'green', 'ZRH': 'skyblue', 'ALC': 'yellow', 'TFS': 'yellow', 'BSL': 'skyblue', 'MAD': 'yellow', 'VRN': 'green', 'ATH': 'skyblue', 'LPA': 'yellow', 'FCO': 'green', 'FRFHN': 'black', 'LDN': 'blue' } # Define the PDF file path and page range to extract pdf_path = Path('./data/775243 Final Bundle.pdf') start_page = 6 # Page number where the tables start end_page = 8 # Page number where the tables end # Define the JSON file #pdf_path = Path('./data/bernard-2024.json') # Load DataFrame if pdf_path.suffix == '.pdf': trips = extract_basic_travel_data(pdf_path, start_page, end_page) elif pdf_path.suffix == '.json': trips = pd.read_json(pdf_path) else: print('File extension <%s> not supported.' % pdf_path.suffix) # Convert to DataFrame df = pd.DataFrame(trips) # Append missing rows using concat df = pd.concat([df, pd.DataFrame(MISSING['veronica'])], ignore_index=True) # Combine consecutive outbound-inbound trips into one row. df_cmb = combine_outbound_inbound(df) # Show print(df_cmb) # Display display(df_cmb) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.785 seconds) .. _sphx_glr_download__examples_ukvi-trips_plot_main01.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_main01.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_main01.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_