.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "_examples\ukvi-trips\plot_main03.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr__examples_ukvi-trips_plot_main03.py: UKVI Travel History Visualizer ============================== This Python script is designed to automate the tracking and visualization of international travel for UK immigration purposes. It extracts travel data directly from a PDF report (such as a UKVI travel history record), calculates the duration in days for each trip abroad, and generates a clear timeline chart using Matplotlib. The resulting plot provides an at-a-glance overview of all absences, making it easier to monitor compliance with the continuous residence requirements for Indefinite Leave to Remain (ILR) or citizenship applications. The script also supports manual entry for trips not captured in the PDF and allows for custom color-coding of destinations. .. note:: An Optical Character Recognition (OCR) approach was chosen for data extraction. This is because the source PDFs contain tables as non-selectable images (rather than text), which cannot be read by standard text-extraction libraries like pdfplumber. .. warning:: **Pytesseract Requires a Separate Installation** Please be aware that pytesseract is just a Python "wrapper." It needs the Tesseract-OCR engine to be installed on your system to do the real work of reading text from images. You must install this engine separately: - On Windows, download the installer from the Tesseract at UB Mannheim page. - On macOS, use Homebrew: brew install tesseract. - On Linux, use your package manager: sudo apt install tesseract-ocr. After installing on Windows, you must explicitly tell pytesseract where to find the executable, as is done in the extract_basic_travel_data function. .. GENERATED FROM PYTHON SOURCE LINES 33-339 .. image-sg:: /_examples/ukvi-trips/images/sphx_glr_plot_main03_001.png :alt: Voyage Durations (total abroad 537 days) :srcset: /_examples/ukvi-trips/images/sphx_glr_plot_main03_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ---> Extracted flight history: Departure Date/Time Arrival Date/Time Voyage Code In/Out Dep Port Arrival Port Date Only 0 2019-12-02 06:15:00 02/12/2019 07:55 FR5993 Inbound MAD STN 2019-12-02 1 2019-12-04 20:05:00 04/12/2019 23:35 FR5998 Outbound STN MAD 2019-12-04 2 2020-01-13 00:20:00 13/01/2020 07:15 VNOO51 Inbound SGN LHR 2020-01-13 3 2020-09-15 10:00:00 15/09/2020 13:15 FR6035 Outbound STN RMI 2020-09-15 4 2021-02-08 15:35:00 08/02/2021 16:55 IB3162 Inbound MAD LHR 2021-02-08 5 2021-06-17 08:25:00 17/06/2021 11:45 FR5994 Outbound STN MAD 2021-06-17 6 2021-10-05 07:05:00 05/10/2021 08:25 FR5993 Inbound MAD STN 2021-10-05 7 2021-11-19 08:05:00 19/11/2021 11:05 FRO194 Outbound STN BLO 2021-11-19 8 2021-12-02 20:45:00 02/12/2021 22:30 U28550 Inbound RMU LGW 2021-12-02 9 2022-01-20 17:20:00 20/01/2022 20:45 UX1016 Outbound LGW MAD 2022-01-20 10 2022-02-15 12:35:00 15/02/2022 14:05 FR5995 Inbound MAD STN 2022-02-15 11 2022-03-24 09:50:00 24/03/2022 14:30 LS1663 Outbound STN TFS 2022-03-24 12 2022-03-30 14:50:00 30/03/2022 19:05 BY4349 Inbound TFS LGW 2022-03-30 13 2022-04-22 11:40:00 22/04/2022 14:20 FR1886 Outbound STN LIS 2022-04-22 14 2022-04-27 15:10:00 27/04/2022 17:50 FR1887 Inbound LIS STN 2022-04-27 15 2022-06-03 12:55:00 03/06/2022 16:15 FR5996 Outbound STN MAD 2022-06-03 16 2022-06-14 15:35:00 14/06/2022 16:55 FR5995 Inbound MAD STN 2022-06-14 17 2022-07-05 15:35:00 05/07/2022 16:55 FR0124 Outbound STN AOI 2022-07-05 18 2022-08-06 05:50:00 06/08/2022 09:10 W94495 Outbound LTN CDT 2022-08-06 19 2022-09-21 06:45:00 21/09/2022 08:05 FR5993 Inbound MAD STN 2022-09-21 20 2022-10-26 13:05:00 26/10/2022 16:25 FR5996 Outbound STN MAD 2022-10-26 21 2022-11-15 15:50:00 15/11/2022 17:15 IB3166 Inbound MAD LHR 2022-11-15 22 2022-11-16 21:25:00 16/11/2022 18:20 MHOOO1 Outbound LHR KUL 2022-11-16 23 2022-12-26 09:05:00 26/12/2022 15:25 MHO004 Inbound KUL LHR 2022-12-26 24 2023-02-04 06:45:00 04/02/2023 12:30 W94467 Outbound LTN ATH 2023-02-04 25 2023-02-06 14:10:00 06/02/2023 16:10 W95746 Inbound ATH LGW 2023-02-06 26 2023-02-21 16:05:00 21/02/2023 19:15 FR3406 Outbound LTN BLQ 2023-02-21 27 2023-03-01 10:15:00 01/03/2023 11:40 FR2496 Inbound PEG STN 2023-03-01 28 2023-04-28 14:45:00 28/04/2023 19:15 FR2842 Outbound STN LPA 2023-04-28 29 2023-05-09 12:20:00 09/05/2023 16:35 FR2843 Inbound LPA STN 2023-05-09 30 2023-06-30 06:45:00 30/06/2023 09:40 FR2489 Outbound STN OVD 2023-06-30 31 2023-09-04 08:50:00 04/09/2023 11:50 W45785 Outbound LGW MXP 2023-09-04 32 2023-09-11 07:10:00 11/09/2023 08:10 W45786 Inbound MXP LGW 2023-09-11 33 2023-10-04 18:15:00 04/10/2023 21:45 FR5996 Outbound STN MAD 2023-10-04 34 2023-10-17 15:35:00 17/10/2023 17:05 FR5993 Inbound MAD STN 2023-10-17 35 2023-12-15 06:15:00 15/12/2023 09:35 FR2497 Outbound STN PEG 2023-12-15 36 2024-01-10 10:45:00 10/01/2024 12:15 FR2629 Inbound MAD STN 2024-01-10 37 2024-02-26 11:40:00 27/02/2024 04:40 CA0848 Outbound LGW PVG 2024-02-26 38 2024-04-24 16:30:00 24/04/2024 19:55 BA0464 Outbound LHR MAD 2024-04-24 39 2024-05-02 20:50:00 02/05/2024 22:10 BAO465 Inbound MAD LHR 2024-05-02 40 2024-06-26 15:50:00 26/06/2024 19:15 IB3177 Outbound LHR MAD 2024-06-26 41 2024-07-01 18:40:00 01/07/2024 20:00 123718 Inbound MAD LGW 2024-07-01 42 2024-07-20 19:15:00 20/07/2024 22:00 LXO0357 Outbound LHR GVA 2024-07-20 Found errors in the flight sequence: ⚠️ Departure Date/Time Arrival Date/Time Voyage Code In/Out Dep Port Arrival Port Date Only validation_error 17 2022-07-05 15:35:00 05/07/2022 16:55 FR0124 Outbound STN AOI 2022-07-05 Error: Part of consecutive Outbound pair 18 2022-08-06 05:50:00 06/08/2022 09:10 W94495 Outbound LTN CDT 2022-08-06 Error: Part of consecutive Outbound pair 30 2023-06-30 06:45:00 30/06/2023 09:40 FR2489 Outbound STN OVD 2023-06-30 Error: Part of consecutive Outbound pair 31 2023-09-04 08:50:00 04/09/2023 11:50 W45785 Outbound LGW MXP 2023-09-04 Error: Part of consecutive Outbound pair 37 2024-02-26 11:40:00 27/02/2024 04:40 CA0848 Outbound LGW PVG 2024-02-26 Error: Part of consecutive Outbound pair 38 2024-04-24 16:30:00 24/04/2024 19:55 BA0464 Outbound LHR MAD 2024-04-24 Error: Part of consecutive Outbound pair ---> Combined trips: Outbound Date Inbound Date Outbound Ports Inbound Ports Days Difference Voyage Code 0 2019-12-04 20:05:00 2020-01-13 07:15:00 STN-MAD SGN-LHR 39 FR5998 1 2020-09-15 10:00:00 2021-02-08 16:55:00 STN-RMI MAD-LHR 146 FR6035 2 2021-06-17 08:25:00 2021-10-05 08:25:00 STN-MAD MAD-STN 110 FR5994 3 2021-11-19 08:05:00 2021-12-02 22:30:00 STN-BLO RMU-LGW 13 FRO194 4 2022-01-20 17:20:00 2022-02-15 14:05:00 LGW-MAD MAD-STN 25 UX1016 5 2022-03-24 09:50:00 2022-03-30 19:05:00 STN-TFS TFS-LGW 6 LS1663 6 2022-04-22 11:40:00 2022-04-27 17:50:00 STN-LIS LIS-STN 5 FR1886 7 2022-06-03 12:55:00 2022-06-14 16:55:00 STN-MAD MAD-STN 11 FR5996 8 2022-08-06 05:50:00 2022-09-21 08:05:00 LTN-CDT MAD-STN 46 W94495 9 2022-10-26 13:05:00 2022-11-15 17:15:00 STN-MAD MAD-LHR 20 FR5996 10 2022-11-16 21:25:00 2022-12-26 15:25:00 LHR-KUL KUL-LHR 39 MHOOO1 11 2023-02-04 06:45:00 2023-02-06 16:10:00 LTN-ATH ATH-LGW 2 W94467 12 2023-02-21 16:05:00 2023-03-01 11:40:00 LTN-BLQ PEG-STN 7 FR3406 13 2023-04-28 14:45:00 2023-05-09 16:35:00 STN-LPA LPA-STN 11 FR2842 14 2023-09-04 08:50:00 2023-09-11 08:10:00 LGW-MXP MXP-LGW 6 W45785 15 2023-10-04 18:15:00 2023-10-17 17:05:00 STN-MAD MAD-STN 12 FR5996 16 2023-12-15 06:15:00 2024-01-10 12:15:00 STN-PEG MAD-STN 26 FR2497 17 2024-04-24 16:30:00 2024-05-02 22:10:00 LHR-MAD MAD-LHR 8 BA0464 18 2024-06-26 15:50:00 2024-07-01 20:00:00 LHR-MAD MAD-LGW 5 IB3177 | .. code-block:: default :lineno-start: 34 # Libraries import pandas as pd import re from PIL import Image from pathlib import Path # Set options to display all rows and columns# pd.set_option('display.max_rows', None) #pd.set_option('display.max_columns', None) #pd.set_option('display.max_colwidth', None) # No truncation of cell content def pdf2png(pdf_path, out_path, start_page=0, end_page=None, flag_save=True): """Converts the pdf to images and saves them in memory. Parameters ---------- pdf_path: str The path for the pdf file. start_page: int The start page with tables. end_page: int The end page with tables. flag_save: bool Whether to save the images. Returns ------- """ # Libraries from pdf2image import convert_from_path # Convert images images = convert_from_path(pdf_path, first_page=start_page, last_page=end_page, dpi=600) # Save images if flag_save: if out_path is None: out_path = pdf_path.with_suffix('') out_path.mkdir(parents=True, exist_ok=True) for i, image in enumerate(images): image.save(out_path / ("page_%02d.png" % (i + start_page))) # Return return images def png2json(png_path, out_path): """Processes PNG images to extract flight data and saves it to a JSON file. Parameters ---------- pdf_path (pathlib.Path): The path to the original PDF file. Returns ------- """ # Extract all flights results = [] for p in sorted(png_path.glob("*.png")): data = extract_basic_travel_data(p) results += data print(p) print(pd.DataFrame(data, columns=headers)) print('\n\n') # Save results as json pd.DataFrame(results, columns=headers) \ .to_json(out_path / 'flights.json', orient="records", indent=4) def extract_basic_travel_data(image_path): """Extract the table information from an image. Parameters ---------- image_path: str or Path The path with the image Returns ------- """ # Libraries import pytesseract import platform # Tell the Python remote where the exe is (Windows) if platform.system() == 'Windows': pytesseract.pytesseract.tesseract_cmd = \ r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Perform OCR on the image image = Image.open(image_path) custom_config = r'--psm 6' # Assume uniform alignment for table raw_text = pytesseract.image_to_string(image, config=custom_config) # Initialize a list to store the extracted rows data = [] # Regular expression to match each row in the table row_pattern = re.compile( r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Departure Date/Time r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Arrival Date/Time r"([A-Za-z0-9]+)\s+" # Voyage Code r"(Inbound|Outbound)\s+" # In/Out r"([A-Z]{3})\s+" # Dep Port r"([A-Z]{3})" # Arrival Port ) # Process the raw text line by line lines = raw_text.split("\n") for i, line in enumerate(lines): match = row_pattern.search(line) if match: data.append(match.groups()) # Return return data def perform_validation(df): """""" MSG1 = 'Error: First flight in log is not Inbound' # Create a new column to store error messages, default to 'OK' df['validation_error'] = 'OK' # Rule 1: Check if the very first flight is Inbound if df.iloc[0]['In/Out'] != 'Inbound': df.loc[0, 'validation_error'] = MSG1 # Rule 2: Check for consecutive Inbound/Outbound flights # This flags a row if it's the same as the one BEFORE it is_same_as_previous = df['In/Out'] == df['In/Out'].shift(1) # This flags a row if it's the same as the one AFTER it is_same_as_next = df['In/Out'] == df['In/Out'].shift(-1) # A row is an error if either of the above conditions is true is_consecutive = is_same_as_previous | is_same_as_next # Add error messages where the sequence is broken df.loc[is_consecutive, 'validation_error'] = \ 'Error: Part of consecutive ' + df['In/Out'] + ' pair' # Return return df # --------------------------------------------------------------- # Main # --------------------------------------------------------------- # .. note:: Run the first time, later it can be disabled # since the extracting of pags from pdf to # png and the consequent OCR to extract the # that information which will be saved in a json file # only needs to be done once. Please check the json and # add any missing flights (either due to poor extraction # or not registered, or other methods of entry to the ountry" # Flags RUN_PDF2PNG = False # Extract pdf pages to png images RUN_OCR = False # Extract data from png and save to json # Define the desired headers for the DataFrame headers = [ "Departure Date/Time", "Arrival Date/Time", "Voyage Code", "In/Out", "Dep Port", "Arrival Port" ] # Configuration of bundles. config = { '1085721': { # BH 'start_page': 6, 'end_page': 12 }, '775243': { # VQ 'start_page': 6, 'end_page': 7 } } # Select the bundle identifier #id = '775243' id = '1085721' # Path pdf_path = Path('./data/%s-final-bundle.pdf' % id) out_path = Path('./outputs/%s' % id) # Extract images from pdf if RUN_PDF2PNG: pdf2png(pdf_path=pdf_path, out_path=out_path , **config[id]) # Extract all flights if RUN_OCR: png2json(png_path=out_path, out_path=out_path) # ----------------------------------------------- # Clean and display # ----------------------------------------------- # Libraries from utils import display_flights from utils import combine_outbound_inbound from utils import MISSING_FLIGHTS from utils import COLORMAP # Load DataFrame (as extracted) df = pd.read_json(out_path / 'flights.json') # Append missing rows using concat df_miss = pd.DataFrame(MISSING_FLIGHTS[id]) df = pd.concat([df, df_miss], ignore_index=True) df = df.drop_duplicates() # Save results as json df.to_json(out_path / 'flights.json', orient="records", indent=4) # Ensure 'Departure Date/Time' is in datetime format df["Departure Date/Time"] = pd.to_datetime( df["Departure Date/Time"], format="%d/%m/%Y %H:%M") # Order chronologically df = df.sort_values(by='Departure Date/Time').reset_index(drop=True) # Remove duplicates based on the date (ignoring hour) and # keep the first occurrence df["Date Only"] = df["Departure Date/Time"].dt.date df = df.drop_duplicates(subset=['Date Only', 'Voyage Code'], keep='first') # Sort the DataFrame by "Departure Date/Time" df = df.sort_values(by="Departure Date/Time").reset_index(drop=True) # Show print("\n\n---> Extracted flight history:\n\n%s" % df) # Validate # -------- # Perform validation df = perform_validation(df) # Extract errors error_df = df[df['validation_error'] != 'OK'] print("\n\n") if error_df.empty: print("All flight sequences are valid! ✅") else: print("Found errors in the flight sequence: ⚠️ \n") print(error_df) # Find the first 'Outbound' # ------------------------ # .. note:: This step has been made redundant. Its # functionality is now incorporated into the # combine_outbound_inbound function. """ # Find the first 'Outbound' flight and trim the DataFrame. # We do this because we want to compute time abroad, hence # each period would be (current outbound - next inbound) try: # Get the index of the first row where 'In/Out' is 'Inbound' first_inbound_index = df[df['In/Out'] == 'Outbound'].index[0] # Slice the DataFrame to start from that index df = df.loc[first_inbound_index:].reset_index(drop=True) except IndexError: pass """ # Combine in/out journeys # ----------------------- # .. note:: For the most accurate results, this function should # be run after the flight data JSON has been manually # corrected. If the data is inconsistent, it will apply # its own assumptions to handle errors (e.g., pairing the # last seen 'Outbound' with the next 'Inbound' and # ignoring invalid sequences). # Combine (inbound, outbound) pairs df_cmb = combine_outbound_inbound(df) # Save results as json df_cmb.to_json(out_path / 'roundtrips.json', orient="records", indent=4) # Show print("\n\n---> Combined trips:\n\n%s" % df_cmb) # Display and save # ----------------------- import matplotlib.pyplot as plt display_flights(df_cmb, cmap=COLORMAP) plt.savefig(out_path / 'graph.jpg') plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.118 seconds) .. _sphx_glr_download__examples_ukvi-trips_plot_main03.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_main03.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_main03.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_