UKVI Travel History Visualizer

This Python script is designed to automate the tracking and visualization of international travel for UK immigration purposes. It extracts travel data directly from a PDF report (such as a UKVI travel history record), calculates the duration in days for each trip abroad, and generates a clear timeline chart using Matplotlib. The resulting plot provides an at-a-glance overview of all absences, making it easier to monitor compliance with the continuous residence requirements for Indefinite Leave to Remain (ILR) or citizenship applications. The script also supports manual entry for trips not captured in the PDF and allows for custom color-coding of destinations.

Note

An Optical Character Recognition (OCR) approach was chosen for data extraction. This is because the source PDFs contain tables as non-selectable images (rather than text), which cannot be read by standard text-extraction libraries like pdfplumber.

Warning

Pytesseract Requires a Separate Installation

Please be aware that pytesseract is just a Python “wrapper.” It needs the Tesseract-OCR engine to be installed on your system to do the real work of reading text from images. You must install this engine separately:

On Windows, download the installer from the Tesseract at UB Mannheim page.

On macOS, use Homebrew: brew install tesseract.

On Linux, use your package manager: sudo apt install tesseract-ocr.

After installing on Windows, you must explicitly tell pytesseract where to find the executable, as is done in the extract_basic_travel_data function.

Voyage Durations (total abroad 537 days)

Out:

---> Extracted flight history:

   Departure Date/Time Arrival Date/Time Voyage Code    In/Out Dep Port Arrival Port   Date Only
2019-12-02 06:15:00  02/12/2019 07:55      FR5993   Inbound      MAD          STN  2019-12-02
2019-12-04 20:05:00  04/12/2019 23:35      FR5998  Outbound      STN          MAD  2019-12-04
2020-01-13 00:20:00  13/01/2020 07:15      VNOO51   Inbound      SGN          LHR  2020-01-13
2020-09-15 10:00:00  15/09/2020 13:15      FR6035  Outbound      STN          RMI  2020-09-15
2021-02-08 15:35:00  08/02/2021 16:55      IB3162   Inbound      MAD          LHR  2021-02-08
2021-06-17 08:25:00  17/06/2021 11:45      FR5994  Outbound      STN          MAD  2021-06-17
2021-10-05 07:05:00  05/10/2021 08:25      FR5993   Inbound      MAD          STN  2021-10-05
2021-11-19 08:05:00  19/11/2021 11:05      FRO194  Outbound      STN          BLO  2021-11-19
2021-12-02 20:45:00  02/12/2021 22:30      U28550   Inbound      RMU          LGW  2021-12-02
2022-01-20 17:20:00  20/01/2022 20:45      UX1016  Outbound      LGW          MAD  2022-01-20
2022-02-15 12:35:00  15/02/2022 14:05      FR5995   Inbound      MAD          STN  2022-02-15
2022-03-24 09:50:00  24/03/2022 14:30      LS1663  Outbound      STN          TFS  2022-03-24
2022-03-30 14:50:00  30/03/2022 19:05      BY4349   Inbound      TFS          LGW  2022-03-30
2022-04-22 11:40:00  22/04/2022 14:20      FR1886  Outbound      STN          LIS  2022-04-22
2022-04-27 15:10:00  27/04/2022 17:50      FR1887   Inbound      LIS          STN  2022-04-27
2022-06-03 12:55:00  03/06/2022 16:15      FR5996  Outbound      STN          MAD  2022-06-03
2022-06-14 15:35:00  14/06/2022 16:55      FR5995   Inbound      MAD          STN  2022-06-14
2022-07-05 15:35:00  05/07/2022 16:55      FR0124  Outbound      STN          AOI  2022-07-05
2022-08-06 05:50:00  06/08/2022 09:10      W94495  Outbound      LTN          CDT  2022-08-06
2022-09-21 06:45:00  21/09/2022 08:05      FR5993   Inbound      MAD          STN  2022-09-21
2022-10-26 13:05:00  26/10/2022 16:25      FR5996  Outbound      STN          MAD  2022-10-26
2022-11-15 15:50:00  15/11/2022 17:15      IB3166   Inbound      MAD          LHR  2022-11-15
2022-11-16 21:25:00  16/11/2022 18:20      MHOOO1  Outbound      LHR          KUL  2022-11-16
2022-12-26 09:05:00  26/12/2022 15:25      MHO004   Inbound      KUL          LHR  2022-12-26
2023-02-04 06:45:00  04/02/2023 12:30      W94467  Outbound      LTN          ATH  2023-02-04
2023-02-06 14:10:00  06/02/2023 16:10      W95746   Inbound      ATH          LGW  2023-02-06
2023-02-21 16:05:00  21/02/2023 19:15      FR3406  Outbound      LTN          BLQ  2023-02-21
2023-03-01 10:15:00  01/03/2023 11:40      FR2496   Inbound      PEG          STN  2023-03-01
2023-04-28 14:45:00  28/04/2023 19:15      FR2842  Outbound      STN          LPA  2023-04-28
2023-05-09 12:20:00  09/05/2023 16:35      FR2843   Inbound      LPA          STN  2023-05-09
2023-06-30 06:45:00  30/06/2023 09:40      FR2489  Outbound      STN          OVD  2023-06-30
2023-09-04 08:50:00  04/09/2023 11:50      W45785  Outbound      LGW          MXP  2023-09-04
2023-09-11 07:10:00  11/09/2023 08:10      W45786   Inbound      MXP          LGW  2023-09-11
2023-10-04 18:15:00  04/10/2023 21:45      FR5996  Outbound      STN          MAD  2023-10-04
2023-10-17 15:35:00  17/10/2023 17:05      FR5993   Inbound      MAD          STN  2023-10-17
2023-12-15 06:15:00  15/12/2023 09:35      FR2497  Outbound      STN          PEG  2023-12-15
2024-01-10 10:45:00  10/01/2024 12:15      FR2629   Inbound      MAD          STN  2024-01-10
2024-02-26 11:40:00  27/02/2024 04:40      CA0848  Outbound      LGW          PVG  2024-02-26
2024-04-24 16:30:00  24/04/2024 19:55      BA0464  Outbound      LHR          MAD  2024-04-24
2024-05-02 20:50:00  02/05/2024 22:10      BAO465   Inbound      MAD          LHR  2024-05-02
2024-06-26 15:50:00  26/06/2024 19:15      IB3177  Outbound      LHR          MAD  2024-06-26
2024-07-01 18:40:00  01/07/2024 20:00      123718   Inbound      MAD          LGW  2024-07-01
2024-07-20 19:15:00  20/07/2024 22:00     LXO0357  Outbound      LHR          GVA  2024-07-20



Found errors in the flight sequence: ⚠️

   Departure Date/Time Arrival Date/Time Voyage Code    In/Out Dep Port Arrival Port   Date Only                          validation_error
2022-07-05 15:35:00  05/07/2022 16:55      FR0124  Outbound      STN          AOI  2022-07-05  Error: Part of consecutive Outbound pair
2022-08-06 05:50:00  06/08/2022 09:10      W94495  Outbound      LTN          CDT  2022-08-06  Error: Part of consecutive Outbound pair
2023-06-30 06:45:00  30/06/2023 09:40      FR2489  Outbound      STN          OVD  2023-06-30  Error: Part of consecutive Outbound pair
2023-09-04 08:50:00  04/09/2023 11:50      W45785  Outbound      LGW          MXP  2023-09-04  Error: Part of consecutive Outbound pair
2024-02-26 11:40:00  27/02/2024 04:40      CA0848  Outbound      LGW          PVG  2024-02-26  Error: Part of consecutive Outbound pair
2024-04-24 16:30:00  24/04/2024 19:55      BA0464  Outbound      LHR          MAD  2024-04-24  Error: Part of consecutive Outbound pair


---> Combined trips:

         Outbound Date        Inbound Date Outbound Ports Inbound Ports  Days Difference Voyage Code
2019-12-04 20:05:00 2020-01-13 07:15:00        STN-MAD       SGN-LHR               39      FR5998
2020-09-15 10:00:00 2021-02-08 16:55:00        STN-RMI       MAD-LHR              146      FR6035
2021-06-17 08:25:00 2021-10-05 08:25:00        STN-MAD       MAD-STN              110      FR5994
2021-11-19 08:05:00 2021-12-02 22:30:00        STN-BLO       RMU-LGW               13      FRO194
2022-01-20 17:20:00 2022-02-15 14:05:00        LGW-MAD       MAD-STN               25      UX1016
2022-03-24 09:50:00 2022-03-30 19:05:00        STN-TFS       TFS-LGW                6      LS1663
2022-04-22 11:40:00 2022-04-27 17:50:00        STN-LIS       LIS-STN                5      FR1886
2022-06-03 12:55:00 2022-06-14 16:55:00        STN-MAD       MAD-STN               11      FR5996
2022-08-06 05:50:00 2022-09-21 08:05:00        LTN-CDT       MAD-STN               46      W94495
2022-10-26 13:05:00 2022-11-15 17:15:00        STN-MAD       MAD-LHR               20      FR5996
2022-11-16 21:25:00 2022-12-26 15:25:00        LHR-KUL       KUL-LHR               39      MHOOO1
2023-02-04 06:45:00 2023-02-06 16:10:00        LTN-ATH       ATH-LGW                2      W94467
2023-02-21 16:05:00 2023-03-01 11:40:00        LTN-BLQ       PEG-STN                7      FR3406
2023-04-28 14:45:00 2023-05-09 16:35:00        STN-LPA       LPA-STN               11      FR2842
2023-09-04 08:50:00 2023-09-11 08:10:00        LGW-MXP       MXP-LGW                6      W45785
2023-10-04 18:15:00 2023-10-17 17:05:00        STN-MAD       MAD-STN               12      FR5996
2023-12-15 06:15:00 2024-01-10 12:15:00        STN-PEG       MAD-STN               26      FR2497
2024-04-24 16:30:00 2024-05-02 22:10:00        LHR-MAD       MAD-LHR                8      BA0464
2024-06-26 15:50:00 2024-07-01 20:00:00        LHR-MAD       MAD-LGW                5      IB3177

 # Libraries
 import pandas as pd
 import re

 from PIL import Image
 from pathlib import Path

 # Set options to display all rows and columns#
 pd.set_option('display.max_rows', None)
 #pd.set_option('display.max_columns', None)
 #pd.set_option('display.max_colwidth', None) # No truncation of cell content

 def pdf2png(pdf_path, out_path, start_page=0, end_page=None, flag_save=True):
     """Converts the pdf to images and saves them in memory.

     Parameters
     ----------
     pdf_path: str
         The path for the pdf file.
     start_page: int
         The start page with tables.
     end_page: int
         The end page with tables.
     flag_save: bool
         Whether to save the images.

     Returns
     -------
     """
     # Libraries
     from pdf2image import convert_from_path

     # Convert images
     images = convert_from_path(pdf_path, first_page=start_page,
         last_page=end_page, dpi=600)

     # Save images
     if flag_save:
         if out_path is None:
             out_path = pdf_path.with_suffix('')
         out_path.mkdir(parents=True, exist_ok=True)
         for i, image in enumerate(images):
             image.save(out_path / ("page_%02d.png" % (i + start_page)))

     # Return
     return images


 def png2json(png_path, out_path):
     """Processes PNG images to extract flight data and saves
     it to a JSON file.

     Parameters
     ----------
     pdf_path (pathlib.Path): The path to the original PDF file.

     Returns
     -------
     """
     # Extract all flights
     results = []
     for p in sorted(png_path.glob("*.png")):
         data = extract_basic_travel_data(p)
         results += data
         print(p)
         print(pd.DataFrame(data, columns=headers))
         print('\n\n')

     # Save results as json
     pd.DataFrame(results, columns=headers) \
         .to_json(out_path / 'flights.json',
             orient="records", indent=4)


 def extract_basic_travel_data(image_path):
     """Extract the table information from an image.

     Parameters
     ----------
     image_path: str or Path
         The path with the image

     Returns
     -------
     """
     # Libraries
     import pytesseract
     import platform
     # Tell the Python remote where the exe is (Windows)
     if platform.system() == 'Windows':
         pytesseract.pytesseract.tesseract_cmd = \
             r'C:\Program Files\Tesseract-OCR\tesseract.exe'
     # Perform OCR on the image
     image = Image.open(image_path)
     custom_config = r'--psm 6'  # Assume uniform alignment for table
     raw_text = pytesseract.image_to_string(image, config=custom_config)

     # Initialize a list to store the extracted rows
     data = []

     # Regular expression to match each row in the table
     row_pattern = re.compile(
         r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+"  # Departure Date/Time
         r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+"  # Arrival Date/Time
         r"([A-Za-z0-9]+)\s+"  # Voyage Code
         r"(Inbound|Outbound)\s+"  # In/Out
         r"([A-Z]{3})\s+"  # Dep Port
         r"([A-Z]{3})"  # Arrival Port
     )

     # Process the raw text line by line
     lines = raw_text.split("\n")
     for i, line in enumerate(lines):
         match = row_pattern.search(line)
         if match:
             data.append(match.groups())

     # Return
     return data


 def perform_validation(df):
     """"""
     MSG1 = 'Error: First flight in log is not Inbound'

     # Create a new column to store error messages, default to 'OK'
     df['validation_error'] = 'OK'

     # Rule 1: Check if the very first flight is Inbound
     if df.iloc[0]['In/Out'] != 'Inbound':
         df.loc[0, 'validation_error'] = MSG1

     # Rule 2: Check for consecutive Inbound/Outbound flights
     # This flags a row if it's the same as the one BEFORE it
     is_same_as_previous = df['In/Out'] == df['In/Out'].shift(1)
     # This flags a row if it's the same as the one AFTER it
     is_same_as_next = df['In/Out'] == df['In/Out'].shift(-1)

     # A row is an error if either of the above conditions is true
     is_consecutive = is_same_as_previous | is_same_as_next

     # Add error messages where the sequence is broken
     df.loc[is_consecutive, 'validation_error'] = \
         'Error: Part of consecutive ' + df['In/Out'] + ' pair'

     # Return
     return df




 # ---------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------
 # .. note:: Run the first time, later it can be disabled
 #           since the extracting of pags from pdf to
 #           png and the consequent OCR to extract the
 #           that information which will be saved in a json file
 #           only needs to be done once. Please check the json and
 #           add any missing flights (either due to poor extraction
 #           or not registered, or other methods of entry to the ountry"

 # Flags
 RUN_PDF2PNG = False   # Extract pdf pages to png images
 RUN_OCR = False       # Extract data from png and save to json

 # Define the desired headers for the DataFrame
 headers = [
     "Departure Date/Time", "Arrival Date/Time",
     "Voyage Code", "In/Out", "Dep Port", "Arrival Port"
 ]

 # Configuration of bundles.
 config = {
     '1085721': {            # BH
         'start_page': 6,
         'end_page': 12
     },
     '775243': {             # VQ
         'start_page': 6,
         'end_page': 7
     }
 }

 # Select the bundle identifier
 #id = '775243'
 id = '1085721'

 # Path
 pdf_path = Path('./data/%s-final-bundle.pdf' % id)
 out_path = Path('./outputs/%s' % id)

 # Extract images from pdf
 if RUN_PDF2PNG:
     pdf2png(pdf_path=pdf_path, out_path=out_path , **config[id])

 # Extract all flights
 if RUN_OCR:
     png2json(png_path=out_path, out_path=out_path)


 # -----------------------------------------------
 # Clean and display
 # -----------------------------------------------
 # Libraries
 from utils import display_flights
 from utils import combine_outbound_inbound
 from utils import MISSING_FLIGHTS
 from utils import COLORMAP

 # Load DataFrame (as extracted)
 df = pd.read_json(out_path / 'flights.json')

 # Append missing rows using concat
 df_miss = pd.DataFrame(MISSING_FLIGHTS[id])
 df = pd.concat([df, df_miss], ignore_index=True)
 df = df.drop_duplicates()

 # Save results as json
 df.to_json(out_path / 'flights.json',
     orient="records", indent=4)

 # Ensure 'Departure Date/Time' is in datetime format
 df["Departure Date/Time"] = pd.to_datetime(
     df["Departure Date/Time"], format="%d/%m/%Y %H:%M")

 # Order chronologically
 df = df.sort_values(by='Departure Date/Time').reset_index(drop=True)

 # Remove duplicates based on the date (ignoring hour) and
 # keep the first occurrence
 df["Date Only"] = df["Departure Date/Time"].dt.date
 df = df.drop_duplicates(subset=['Date Only', 'Voyage Code'], keep='first')

 # Sort the DataFrame by "Departure Date/Time"
 df = df.sort_values(by="Departure Date/Time").reset_index(drop=True)

 # Show
 print("\n\n---> Extracted flight history:\n\n%s" % df)


 # Validate
 # --------
 # Perform validation
 df = perform_validation(df)

 # Extract errors
 error_df = df[df['validation_error'] != 'OK']

 print("\n\n")
 if error_df.empty:
     print("All flight sequences are valid! ✅")
 else:
     print("Found errors in the flight sequence: ⚠️ \n")
     print(error_df)



 # Find the first 'Outbound'
 # ------------------------
 #  .. note:: This step has been made redundant. Its
 #            functionality is now incorporated into the
 #             combine_outbound_inbound function.

 """
 # Find the first 'Outbound' flight and trim the DataFrame.
 # We do this because we want to compute time abroad, hence
 # each period would be (current outbound - next inbound)
 try:
     # Get the index of the first row where 'In/Out' is 'Inbound'
     first_inbound_index = df[df['In/Out'] == 'Outbound'].index[0]
     # Slice the DataFrame to start from that index
     df = df.loc[first_inbound_index:].reset_index(drop=True)
 except IndexError:
     pass
 """



 # Combine in/out journeys
 # -----------------------
 # .. note:: For the most accurate results, this function should
 #           be run after the flight data JSON has been manually
 #           corrected. If the data is inconsistent, it will apply
 #           its own assumptions to handle errors (e.g., pairing the
 #           last seen 'Outbound' with the next 'Inbound' and
 #           ignoring invalid sequences).

 # Combine (inbound, outbound) pairs
 df_cmb = combine_outbound_inbound(df)

 # Save results as json
 df_cmb.to_json(out_path / 'roundtrips.json',
     orient="records", indent=4)

 # Show
 print("\n\n---> Combined trips:\n\n%s" % df_cmb)


 # Display and save
 # -----------------------
 import matplotlib.pyplot as plt
 display_flights(df_cmb, cmap=COLORMAP)
 plt.savefig(out_path / 'graph.jpg')
 plt.show()

Total running time of the script: ( 0 minutes 1.118 seconds)

Download Python source code: plot_main03.py

Download Jupyter notebook: plot_main03.ipynb

Gallery generated by Sphinx-Gallery