UKVI Travel History Visualizer

This Python script is designed to automate the tracking and visualization of international travel for UK immigration purposes. It extracts travel data directly from a PDF report (such as a UKVI travel history record), calculates the duration in days for each trip abroad, and generates a clear timeline chart using Matplotlib. The resulting plot provides an at-a-glance overview of all absences, making it easier to monitor compliance with the continuous residence requirements for Indefinite Leave to Remain (ILR) or citizenship applications. The script also supports manual entry for trips not captured in the PDF and allows for custom color-coding of destinations.

Note

An Optical Character Recognition (OCR) approach was chosen for data extraction. This is because the source PDFs contain tables as non-selectable images (rather than text), which cannot be read by standard text-extraction libraries like pdfplumber.

Warning

Pytesseract Requires a Separate Installation

Please be aware that pytesseract is just a Python “wrapper.” It needs the Tesseract-OCR engine to be installed on your system to do the real work of reading text from images. You must install this engine separately:

  • On Windows, download the installer from the Tesseract at UB Mannheim page.

  • On macOS, use Homebrew: brew install tesseract.

  • On Linux, use your package manager: sudo apt install tesseract-ocr.

After installing on Windows, you must explicitly tell pytesseract where to find the executable, as is done in the extract_basic_travel_data function.

Voyage Durations (total abroad 537 days)

Out:

---> Extracted flight history:

   Departure Date/Time Arrival Date/Time Voyage Code    In/Out Dep Port Arrival Port   Date Only
0  2019-12-02 06:15:00  02/12/2019 07:55      FR5993   Inbound      MAD          STN  2019-12-02
1  2019-12-04 20:05:00  04/12/2019 23:35      FR5998  Outbound      STN          MAD  2019-12-04
2  2020-01-13 00:20:00  13/01/2020 07:15      VNOO51   Inbound      SGN          LHR  2020-01-13
3  2020-09-15 10:00:00  15/09/2020 13:15      FR6035  Outbound      STN          RMI  2020-09-15
4  2021-02-08 15:35:00  08/02/2021 16:55      IB3162   Inbound      MAD          LHR  2021-02-08
5  2021-06-17 08:25:00  17/06/2021 11:45      FR5994  Outbound      STN          MAD  2021-06-17
6  2021-10-05 07:05:00  05/10/2021 08:25      FR5993   Inbound      MAD          STN  2021-10-05
7  2021-11-19 08:05:00  19/11/2021 11:05      FRO194  Outbound      STN          BLO  2021-11-19
8  2021-12-02 20:45:00  02/12/2021 22:30      U28550   Inbound      RMU          LGW  2021-12-02
9  2022-01-20 17:20:00  20/01/2022 20:45      UX1016  Outbound      LGW          MAD  2022-01-20
10 2022-02-15 12:35:00  15/02/2022 14:05      FR5995   Inbound      MAD          STN  2022-02-15
11 2022-03-24 09:50:00  24/03/2022 14:30      LS1663  Outbound      STN          TFS  2022-03-24
12 2022-03-30 14:50:00  30/03/2022 19:05      BY4349   Inbound      TFS          LGW  2022-03-30
13 2022-04-22 11:40:00  22/04/2022 14:20      FR1886  Outbound      STN          LIS  2022-04-22
14 2022-04-27 15:10:00  27/04/2022 17:50      FR1887   Inbound      LIS          STN  2022-04-27
15 2022-06-03 12:55:00  03/06/2022 16:15      FR5996  Outbound      STN          MAD  2022-06-03
16 2022-06-14 15:35:00  14/06/2022 16:55      FR5995   Inbound      MAD          STN  2022-06-14
17 2022-07-05 15:35:00  05/07/2022 16:55      FR0124  Outbound      STN          AOI  2022-07-05
18 2022-08-06 05:50:00  06/08/2022 09:10      W94495  Outbound      LTN          CDT  2022-08-06
19 2022-09-21 06:45:00  21/09/2022 08:05      FR5993   Inbound      MAD          STN  2022-09-21
20 2022-10-26 13:05:00  26/10/2022 16:25      FR5996  Outbound      STN          MAD  2022-10-26
21 2022-11-15 15:50:00  15/11/2022 17:15      IB3166   Inbound      MAD          LHR  2022-11-15
22 2022-11-16 21:25:00  16/11/2022 18:20      MHOOO1  Outbound      LHR          KUL  2022-11-16
23 2022-12-26 09:05:00  26/12/2022 15:25      MHO004   Inbound      KUL          LHR  2022-12-26
24 2023-02-04 06:45:00  04/02/2023 12:30      W94467  Outbound      LTN          ATH  2023-02-04
25 2023-02-06 14:10:00  06/02/2023 16:10      W95746   Inbound      ATH          LGW  2023-02-06
26 2023-02-21 16:05:00  21/02/2023 19:15      FR3406  Outbound      LTN          BLQ  2023-02-21
27 2023-03-01 10:15:00  01/03/2023 11:40      FR2496   Inbound      PEG          STN  2023-03-01
28 2023-04-28 14:45:00  28/04/2023 19:15      FR2842  Outbound      STN          LPA  2023-04-28
29 2023-05-09 12:20:00  09/05/2023 16:35      FR2843   Inbound      LPA          STN  2023-05-09
30 2023-06-30 06:45:00  30/06/2023 09:40      FR2489  Outbound      STN          OVD  2023-06-30
31 2023-09-04 08:50:00  04/09/2023 11:50      W45785  Outbound      LGW          MXP  2023-09-04
32 2023-09-11 07:10:00  11/09/2023 08:10      W45786   Inbound      MXP          LGW  2023-09-11
33 2023-10-04 18:15:00  04/10/2023 21:45      FR5996  Outbound      STN          MAD  2023-10-04
34 2023-10-17 15:35:00  17/10/2023 17:05      FR5993   Inbound      MAD          STN  2023-10-17
35 2023-12-15 06:15:00  15/12/2023 09:35      FR2497  Outbound      STN          PEG  2023-12-15
36 2024-01-10 10:45:00  10/01/2024 12:15      FR2629   Inbound      MAD          STN  2024-01-10
37 2024-02-26 11:40:00  27/02/2024 04:40      CA0848  Outbound      LGW          PVG  2024-02-26
38 2024-04-24 16:30:00  24/04/2024 19:55      BA0464  Outbound      LHR          MAD  2024-04-24
39 2024-05-02 20:50:00  02/05/2024 22:10      BAO465   Inbound      MAD          LHR  2024-05-02
40 2024-06-26 15:50:00  26/06/2024 19:15      IB3177  Outbound      LHR          MAD  2024-06-26
41 2024-07-01 18:40:00  01/07/2024 20:00      123718   Inbound      MAD          LGW  2024-07-01
42 2024-07-20 19:15:00  20/07/2024 22:00     LXO0357  Outbound      LHR          GVA  2024-07-20



Found errors in the flight sequence: ⚠️

   Departure Date/Time Arrival Date/Time Voyage Code    In/Out Dep Port Arrival Port   Date Only                          validation_error
17 2022-07-05 15:35:00  05/07/2022 16:55      FR0124  Outbound      STN          AOI  2022-07-05  Error: Part of consecutive Outbound pair
18 2022-08-06 05:50:00  06/08/2022 09:10      W94495  Outbound      LTN          CDT  2022-08-06  Error: Part of consecutive Outbound pair
30 2023-06-30 06:45:00  30/06/2023 09:40      FR2489  Outbound      STN          OVD  2023-06-30  Error: Part of consecutive Outbound pair
31 2023-09-04 08:50:00  04/09/2023 11:50      W45785  Outbound      LGW          MXP  2023-09-04  Error: Part of consecutive Outbound pair
37 2024-02-26 11:40:00  27/02/2024 04:40      CA0848  Outbound      LGW          PVG  2024-02-26  Error: Part of consecutive Outbound pair
38 2024-04-24 16:30:00  24/04/2024 19:55      BA0464  Outbound      LHR          MAD  2024-04-24  Error: Part of consecutive Outbound pair


---> Combined trips:

         Outbound Date        Inbound Date Outbound Ports Inbound Ports  Days Difference Voyage Code
0  2019-12-04 20:05:00 2020-01-13 07:15:00        STN-MAD       SGN-LHR               39      FR5998
1  2020-09-15 10:00:00 2021-02-08 16:55:00        STN-RMI       MAD-LHR              146      FR6035
2  2021-06-17 08:25:00 2021-10-05 08:25:00        STN-MAD       MAD-STN              110      FR5994
3  2021-11-19 08:05:00 2021-12-02 22:30:00        STN-BLO       RMU-LGW               13      FRO194
4  2022-01-20 17:20:00 2022-02-15 14:05:00        LGW-MAD       MAD-STN               25      UX1016
5  2022-03-24 09:50:00 2022-03-30 19:05:00        STN-TFS       TFS-LGW                6      LS1663
6  2022-04-22 11:40:00 2022-04-27 17:50:00        STN-LIS       LIS-STN                5      FR1886
7  2022-06-03 12:55:00 2022-06-14 16:55:00        STN-MAD       MAD-STN               11      FR5996
8  2022-08-06 05:50:00 2022-09-21 08:05:00        LTN-CDT       MAD-STN               46      W94495
9  2022-10-26 13:05:00 2022-11-15 17:15:00        STN-MAD       MAD-LHR               20      FR5996
10 2022-11-16 21:25:00 2022-12-26 15:25:00        LHR-KUL       KUL-LHR               39      MHOOO1
11 2023-02-04 06:45:00 2023-02-06 16:10:00        LTN-ATH       ATH-LGW                2      W94467
12 2023-02-21 16:05:00 2023-03-01 11:40:00        LTN-BLQ       PEG-STN                7      FR3406
13 2023-04-28 14:45:00 2023-05-09 16:35:00        STN-LPA       LPA-STN               11      FR2842
14 2023-09-04 08:50:00 2023-09-11 08:10:00        LGW-MXP       MXP-LGW                6      W45785
15 2023-10-04 18:15:00 2023-10-17 17:05:00        STN-MAD       MAD-STN               12      FR5996
16 2023-12-15 06:15:00 2024-01-10 12:15:00        STN-PEG       MAD-STN               26      FR2497
17 2024-04-24 16:30:00 2024-05-02 22:10:00        LHR-MAD       MAD-LHR                8      BA0464
18 2024-06-26 15:50:00 2024-07-01 20:00:00        LHR-MAD       MAD-LGW                5      IB3177

 34 # Libraries
 35 import pandas as pd
 36 import re
 37
 38 from PIL import Image
 39 from pathlib import Path
 40
 41 # Set options to display all rows and columns#
 42 pd.set_option('display.max_rows', None)
 43 #pd.set_option('display.max_columns', None)
 44 #pd.set_option('display.max_colwidth', None) # No truncation of cell content
 45
 46 def pdf2png(pdf_path, out_path, start_page=0, end_page=None, flag_save=True):
 47     """Converts the pdf to images and saves them in memory.
 48
 49     Parameters
 50     ----------
 51     pdf_path: str
 52         The path for the pdf file.
 53     start_page: int
 54         The start page with tables.
 55     end_page: int
 56         The end page with tables.
 57     flag_save: bool
 58         Whether to save the images.
 59
 60     Returns
 61     -------
 62     """
 63     # Libraries
 64     from pdf2image import convert_from_path
 65
 66     # Convert images
 67     images = convert_from_path(pdf_path, first_page=start_page,
 68         last_page=end_page, dpi=600)
 69
 70     # Save images
 71     if flag_save:
 72         if out_path is None:
 73             out_path = pdf_path.with_suffix('')
 74         out_path.mkdir(parents=True, exist_ok=True)
 75         for i, image in enumerate(images):
 76             image.save(out_path / ("page_%02d.png" % (i + start_page)))
 77
 78     # Return
 79     return images
 80
 81
 82 def png2json(png_path, out_path):
 83     """Processes PNG images to extract flight data and saves
 84     it to a JSON file.
 85
 86     Parameters
 87     ----------
 88     pdf_path (pathlib.Path): The path to the original PDF file.
 89
 90     Returns
 91     -------
 92     """
 93     # Extract all flights
 94     results = []
 95     for p in sorted(png_path.glob("*.png")):
 96         data = extract_basic_travel_data(p)
 97         results += data
 98         print(p)
 99         print(pd.DataFrame(data, columns=headers))
100         print('\n\n')
101
102     # Save results as json
103     pd.DataFrame(results, columns=headers) \
104         .to_json(out_path / 'flights.json',
105             orient="records", indent=4)
106
107
108 def extract_basic_travel_data(image_path):
109     """Extract the table information from an image.
110
111     Parameters
112     ----------
113     image_path: str or Path
114         The path with the image
115
116     Returns
117     -------
118     """
119     # Libraries
120     import pytesseract
121     import platform
122     # Tell the Python remote where the exe is (Windows)
123     if platform.system() == 'Windows':
124         pytesseract.pytesseract.tesseract_cmd = \
125             r'C:\Program Files\Tesseract-OCR\tesseract.exe'
126     # Perform OCR on the image
127     image = Image.open(image_path)
128     custom_config = r'--psm 6'  # Assume uniform alignment for table
129     raw_text = pytesseract.image_to_string(image, config=custom_config)
130
131     # Initialize a list to store the extracted rows
132     data = []
133
134     # Regular expression to match each row in the table
135     row_pattern = re.compile(
136         r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+"  # Departure Date/Time
137         r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+"  # Arrival Date/Time
138         r"([A-Za-z0-9]+)\s+"  # Voyage Code
139         r"(Inbound|Outbound)\s+"  # In/Out
140         r"([A-Z]{3})\s+"  # Dep Port
141         r"([A-Z]{3})"  # Arrival Port
142     )
143
144     # Process the raw text line by line
145     lines = raw_text.split("\n")
146     for i, line in enumerate(lines):
147         match = row_pattern.search(line)
148         if match:
149             data.append(match.groups())
150
151     # Return
152     return data
153
154
155 def perform_validation(df):
156     """"""
157     MSG1 = 'Error: First flight in log is not Inbound'
158
159     # Create a new column to store error messages, default to 'OK'
160     df['validation_error'] = 'OK'
161
162     # Rule 1: Check if the very first flight is Inbound
163     if df.iloc[0]['In/Out'] != 'Inbound':
164         df.loc[0, 'validation_error'] = MSG1
165
166     # Rule 2: Check for consecutive Inbound/Outbound flights
167     # This flags a row if it's the same as the one BEFORE it
168     is_same_as_previous = df['In/Out'] == df['In/Out'].shift(1)
169     # This flags a row if it's the same as the one AFTER it
170     is_same_as_next = df['In/Out'] == df['In/Out'].shift(-1)
171
172     # A row is an error if either of the above conditions is true
173     is_consecutive = is_same_as_previous | is_same_as_next
174
175     # Add error messages where the sequence is broken
176     df.loc[is_consecutive, 'validation_error'] = \
177         'Error: Part of consecutive ' + df['In/Out'] + ' pair'
178
179     # Return
180     return df
181
182
183
184
185 # ---------------------------------------------------------------
186 # Main
187 # ---------------------------------------------------------------
188 # .. note:: Run the first time, later it can be disabled
189 #           since the extracting of pags from pdf to
190 #           png and the consequent OCR to extract the
191 #           that information which will be saved in a json file
192 #           only needs to be done once. Please check the json and
193 #           add any missing flights (either due to poor extraction
194 #           or not registered, or other methods of entry to the ountry"
195
196 # Flags
197 RUN_PDF2PNG = False   # Extract pdf pages to png images
198 RUN_OCR = False       # Extract data from png and save to json
199
200 # Define the desired headers for the DataFrame
201 headers = [
202     "Departure Date/Time", "Arrival Date/Time",
203     "Voyage Code", "In/Out", "Dep Port", "Arrival Port"
204 ]
205
206 # Configuration of bundles.
207 config = {
208     '1085721': {            # BH
209         'start_page': 6,
210         'end_page': 12
211     },
212     '775243': {             # VQ
213         'start_page': 6,
214         'end_page': 7
215     }
216 }
217
218 # Select the bundle identifier
219 #id = '775243'
220 id = '1085721'
221
222 # Path
223 pdf_path = Path('./data/%s-final-bundle.pdf' % id)
224 out_path = Path('./outputs/%s' % id)
225
226 # Extract images from pdf
227 if RUN_PDF2PNG:
228     pdf2png(pdf_path=pdf_path, out_path=out_path , **config[id])
229
230 # Extract all flights
231 if RUN_OCR:
232     png2json(png_path=out_path, out_path=out_path)
233
234
235 # -----------------------------------------------
236 # Clean and display
237 # -----------------------------------------------
238 # Libraries
239 from utils import display_flights
240 from utils import combine_outbound_inbound
241 from utils import MISSING_FLIGHTS
242 from utils import COLORMAP
243
244 # Load DataFrame (as extracted)
245 df = pd.read_json(out_path / 'flights.json')
246
247 # Append missing rows using concat
248 df_miss = pd.DataFrame(MISSING_FLIGHTS[id])
249 df = pd.concat([df, df_miss], ignore_index=True)
250 df = df.drop_duplicates()
251
252 # Save results as json
253 df.to_json(out_path / 'flights.json',
254     orient="records", indent=4)
255
256 # Ensure 'Departure Date/Time' is in datetime format
257 df["Departure Date/Time"] = pd.to_datetime(
258     df["Departure Date/Time"], format="%d/%m/%Y %H:%M")
259
260 # Order chronologically
261 df = df.sort_values(by='Departure Date/Time').reset_index(drop=True)
262
263 # Remove duplicates based on the date (ignoring hour) and
264 # keep the first occurrence
265 df["Date Only"] = df["Departure Date/Time"].dt.date
266 df = df.drop_duplicates(subset=['Date Only', 'Voyage Code'], keep='first')
267
268 # Sort the DataFrame by "Departure Date/Time"
269 df = df.sort_values(by="Departure Date/Time").reset_index(drop=True)
270
271 # Show
272 print("\n\n---> Extracted flight history:\n\n%s" % df)
273
274
275 # Validate
276 # --------
277 # Perform validation
278 df = perform_validation(df)
279
280 # Extract errors
281 error_df = df[df['validation_error'] != 'OK']
282
283 print("\n\n")
284 if error_df.empty:
285     print("All flight sequences are valid! ✅")
286 else:
287     print("Found errors in the flight sequence: ⚠️ \n")
288     print(error_df)
289
290
291
292 # Find the first 'Outbound'
293 # ------------------------
294 #  .. note:: This step has been made redundant. Its
295 #            functionality is now incorporated into the
296 #             combine_outbound_inbound function.
297
298 """
299 # Find the first 'Outbound' flight and trim the DataFrame.
300 # We do this because we want to compute time abroad, hence
301 # each period would be (current outbound - next inbound)
302 try:
303     # Get the index of the first row where 'In/Out' is 'Inbound'
304     first_inbound_index = df[df['In/Out'] == 'Outbound'].index[0]
305     # Slice the DataFrame to start from that index
306     df = df.loc[first_inbound_index:].reset_index(drop=True)
307 except IndexError:
308     pass
309 """
310
311
312
313 # Combine in/out journeys
314 # -----------------------
315 # .. note:: For the most accurate results, this function should
316 #           be run after the flight data JSON has been manually
317 #           corrected. If the data is inconsistent, it will apply
318 #           its own assumptions to handle errors (e.g., pairing the
319 #           last seen 'Outbound' with the next 'Inbound' and
320 #           ignoring invalid sequences).
321
322 # Combine (inbound, outbound) pairs
323 df_cmb = combine_outbound_inbound(df)
324
325 # Save results as json
326 df_cmb.to_json(out_path / 'roundtrips.json',
327     orient="records", indent=4)
328
329 # Show
330 print("\n\n---> Combined trips:\n\n%s" % df_cmb)
331
332
333 # Display and save
334 # -----------------------
335 import matplotlib.pyplot as plt
336 display_flights(df_cmb, cmap=COLORMAP)
337 plt.savefig(out_path / 'graph.jpg')
338 plt.show()

Total running time of the script: ( 0 minutes 1.118 seconds)

Gallery generated by Sphinx-Gallery