Note
Click here to download the full example code
UKVI Travel History Visualizer
This Python script is designed to automate the tracking and visualization of international travel for UK immigration purposes. It extracts travel data directly from a PDF report (such as a UKVI travel history record), calculates the duration in days for each trip abroad, and generates a clear timeline chart using Matplotlib. The resulting plot provides an at-a-glance overview of all absences, making it easier to monitor compliance with the continuous residence requirements for Indefinite Leave to Remain (ILR) or citizenship applications. The script also supports manual entry for trips not captured in the PDF and allows for custom color-coding of destinations.
Note
An Optical Character Recognition (OCR) approach was chosen for data extraction. This is because the source PDFs contain tables as non-selectable images (rather than text), which cannot be read by standard text-extraction libraries like pdfplumber.
Warning
Pytesseract Requires a Separate Installation
Please be aware that pytesseract is just a Python “wrapper.” It needs the Tesseract-OCR engine to be installed on your system to do the real work of reading text from images. You must install this engine separately:
On Windows, download the installer from the Tesseract at UB Mannheim page.
On macOS, use Homebrew: brew install tesseract.
On Linux, use your package manager: sudo apt install tesseract-ocr.
After installing on Windows, you must explicitly tell pytesseract where to find the executable, as is done in the extract_basic_travel_data function.

Out:
---> Extracted flight history:
Departure Date/Time Arrival Date/Time Voyage Code In/Out Dep Port Arrival Port Date Only
0 2019-12-02 06:15:00 02/12/2019 07:55 FR5993 Inbound MAD STN 2019-12-02
1 2019-12-04 20:05:00 04/12/2019 23:35 FR5998 Outbound STN MAD 2019-12-04
2 2020-01-13 00:20:00 13/01/2020 07:15 VNOO51 Inbound SGN LHR 2020-01-13
3 2020-09-15 10:00:00 15/09/2020 13:15 FR6035 Outbound STN RMI 2020-09-15
4 2021-02-08 15:35:00 08/02/2021 16:55 IB3162 Inbound MAD LHR 2021-02-08
5 2021-06-17 08:25:00 17/06/2021 11:45 FR5994 Outbound STN MAD 2021-06-17
6 2021-10-05 07:05:00 05/10/2021 08:25 FR5993 Inbound MAD STN 2021-10-05
7 2021-11-19 08:05:00 19/11/2021 11:05 FRO194 Outbound STN BLO 2021-11-19
8 2021-12-02 20:45:00 02/12/2021 22:30 U28550 Inbound RMU LGW 2021-12-02
9 2022-01-20 17:20:00 20/01/2022 20:45 UX1016 Outbound LGW MAD 2022-01-20
10 2022-02-15 12:35:00 15/02/2022 14:05 FR5995 Inbound MAD STN 2022-02-15
11 2022-03-24 09:50:00 24/03/2022 14:30 LS1663 Outbound STN TFS 2022-03-24
12 2022-03-30 14:50:00 30/03/2022 19:05 BY4349 Inbound TFS LGW 2022-03-30
13 2022-04-22 11:40:00 22/04/2022 14:20 FR1886 Outbound STN LIS 2022-04-22
14 2022-04-27 15:10:00 27/04/2022 17:50 FR1887 Inbound LIS STN 2022-04-27
15 2022-06-03 12:55:00 03/06/2022 16:15 FR5996 Outbound STN MAD 2022-06-03
16 2022-06-14 15:35:00 14/06/2022 16:55 FR5995 Inbound MAD STN 2022-06-14
17 2022-07-05 15:35:00 05/07/2022 16:55 FR0124 Outbound STN AOI 2022-07-05
18 2022-08-06 05:50:00 06/08/2022 09:10 W94495 Outbound LTN CDT 2022-08-06
19 2022-09-21 06:45:00 21/09/2022 08:05 FR5993 Inbound MAD STN 2022-09-21
20 2022-10-26 13:05:00 26/10/2022 16:25 FR5996 Outbound STN MAD 2022-10-26
21 2022-11-15 15:50:00 15/11/2022 17:15 IB3166 Inbound MAD LHR 2022-11-15
22 2022-11-16 21:25:00 16/11/2022 18:20 MHOOO1 Outbound LHR KUL 2022-11-16
23 2022-12-26 09:05:00 26/12/2022 15:25 MHO004 Inbound KUL LHR 2022-12-26
24 2023-02-04 06:45:00 04/02/2023 12:30 W94467 Outbound LTN ATH 2023-02-04
25 2023-02-06 14:10:00 06/02/2023 16:10 W95746 Inbound ATH LGW 2023-02-06
26 2023-02-21 16:05:00 21/02/2023 19:15 FR3406 Outbound LTN BLQ 2023-02-21
27 2023-03-01 10:15:00 01/03/2023 11:40 FR2496 Inbound PEG STN 2023-03-01
28 2023-04-28 14:45:00 28/04/2023 19:15 FR2842 Outbound STN LPA 2023-04-28
29 2023-05-09 12:20:00 09/05/2023 16:35 FR2843 Inbound LPA STN 2023-05-09
30 2023-06-30 06:45:00 30/06/2023 09:40 FR2489 Outbound STN OVD 2023-06-30
31 2023-09-04 08:50:00 04/09/2023 11:50 W45785 Outbound LGW MXP 2023-09-04
32 2023-09-11 07:10:00 11/09/2023 08:10 W45786 Inbound MXP LGW 2023-09-11
33 2023-10-04 18:15:00 04/10/2023 21:45 FR5996 Outbound STN MAD 2023-10-04
34 2023-10-17 15:35:00 17/10/2023 17:05 FR5993 Inbound MAD STN 2023-10-17
35 2023-12-15 06:15:00 15/12/2023 09:35 FR2497 Outbound STN PEG 2023-12-15
36 2024-01-10 10:45:00 10/01/2024 12:15 FR2629 Inbound MAD STN 2024-01-10
37 2024-02-26 11:40:00 27/02/2024 04:40 CA0848 Outbound LGW PVG 2024-02-26
38 2024-04-24 16:30:00 24/04/2024 19:55 BA0464 Outbound LHR MAD 2024-04-24
39 2024-05-02 20:50:00 02/05/2024 22:10 BAO465 Inbound MAD LHR 2024-05-02
40 2024-06-26 15:50:00 26/06/2024 19:15 IB3177 Outbound LHR MAD 2024-06-26
41 2024-07-01 18:40:00 01/07/2024 20:00 123718 Inbound MAD LGW 2024-07-01
42 2024-07-20 19:15:00 20/07/2024 22:00 LXO0357 Outbound LHR GVA 2024-07-20
Found errors in the flight sequence: ⚠️
Departure Date/Time Arrival Date/Time Voyage Code In/Out Dep Port Arrival Port Date Only validation_error
17 2022-07-05 15:35:00 05/07/2022 16:55 FR0124 Outbound STN AOI 2022-07-05 Error: Part of consecutive Outbound pair
18 2022-08-06 05:50:00 06/08/2022 09:10 W94495 Outbound LTN CDT 2022-08-06 Error: Part of consecutive Outbound pair
30 2023-06-30 06:45:00 30/06/2023 09:40 FR2489 Outbound STN OVD 2023-06-30 Error: Part of consecutive Outbound pair
31 2023-09-04 08:50:00 04/09/2023 11:50 W45785 Outbound LGW MXP 2023-09-04 Error: Part of consecutive Outbound pair
37 2024-02-26 11:40:00 27/02/2024 04:40 CA0848 Outbound LGW PVG 2024-02-26 Error: Part of consecutive Outbound pair
38 2024-04-24 16:30:00 24/04/2024 19:55 BA0464 Outbound LHR MAD 2024-04-24 Error: Part of consecutive Outbound pair
---> Combined trips:
Outbound Date Inbound Date Outbound Ports Inbound Ports Days Difference Voyage Code
0 2019-12-04 20:05:00 2020-01-13 07:15:00 STN-MAD SGN-LHR 39 FR5998
1 2020-09-15 10:00:00 2021-02-08 16:55:00 STN-RMI MAD-LHR 146 FR6035
2 2021-06-17 08:25:00 2021-10-05 08:25:00 STN-MAD MAD-STN 110 FR5994
3 2021-11-19 08:05:00 2021-12-02 22:30:00 STN-BLO RMU-LGW 13 FRO194
4 2022-01-20 17:20:00 2022-02-15 14:05:00 LGW-MAD MAD-STN 25 UX1016
5 2022-03-24 09:50:00 2022-03-30 19:05:00 STN-TFS TFS-LGW 6 LS1663
6 2022-04-22 11:40:00 2022-04-27 17:50:00 STN-LIS LIS-STN 5 FR1886
7 2022-06-03 12:55:00 2022-06-14 16:55:00 STN-MAD MAD-STN 11 FR5996
8 2022-08-06 05:50:00 2022-09-21 08:05:00 LTN-CDT MAD-STN 46 W94495
9 2022-10-26 13:05:00 2022-11-15 17:15:00 STN-MAD MAD-LHR 20 FR5996
10 2022-11-16 21:25:00 2022-12-26 15:25:00 LHR-KUL KUL-LHR 39 MHOOO1
11 2023-02-04 06:45:00 2023-02-06 16:10:00 LTN-ATH ATH-LGW 2 W94467
12 2023-02-21 16:05:00 2023-03-01 11:40:00 LTN-BLQ PEG-STN 7 FR3406
13 2023-04-28 14:45:00 2023-05-09 16:35:00 STN-LPA LPA-STN 11 FR2842
14 2023-09-04 08:50:00 2023-09-11 08:10:00 LGW-MXP MXP-LGW 6 W45785
15 2023-10-04 18:15:00 2023-10-17 17:05:00 STN-MAD MAD-STN 12 FR5996
16 2023-12-15 06:15:00 2024-01-10 12:15:00 STN-PEG MAD-STN 26 FR2497
17 2024-04-24 16:30:00 2024-05-02 22:10:00 LHR-MAD MAD-LHR 8 BA0464
18 2024-06-26 15:50:00 2024-07-01 20:00:00 LHR-MAD MAD-LGW 5 IB3177
34 # Libraries
35 import pandas as pd
36 import re
37
38 from PIL import Image
39 from pathlib import Path
40
41 # Set options to display all rows and columns#
42 pd.set_option('display.max_rows', None)
43 #pd.set_option('display.max_columns', None)
44 #pd.set_option('display.max_colwidth', None) # No truncation of cell content
45
46 def pdf2png(pdf_path, out_path, start_page=0, end_page=None, flag_save=True):
47 """Converts the pdf to images and saves them in memory.
48
49 Parameters
50 ----------
51 pdf_path: str
52 The path for the pdf file.
53 start_page: int
54 The start page with tables.
55 end_page: int
56 The end page with tables.
57 flag_save: bool
58 Whether to save the images.
59
60 Returns
61 -------
62 """
63 # Libraries
64 from pdf2image import convert_from_path
65
66 # Convert images
67 images = convert_from_path(pdf_path, first_page=start_page,
68 last_page=end_page, dpi=600)
69
70 # Save images
71 if flag_save:
72 if out_path is None:
73 out_path = pdf_path.with_suffix('')
74 out_path.mkdir(parents=True, exist_ok=True)
75 for i, image in enumerate(images):
76 image.save(out_path / ("page_%02d.png" % (i + start_page)))
77
78 # Return
79 return images
80
81
82 def png2json(png_path, out_path):
83 """Processes PNG images to extract flight data and saves
84 it to a JSON file.
85
86 Parameters
87 ----------
88 pdf_path (pathlib.Path): The path to the original PDF file.
89
90 Returns
91 -------
92 """
93 # Extract all flights
94 results = []
95 for p in sorted(png_path.glob("*.png")):
96 data = extract_basic_travel_data(p)
97 results += data
98 print(p)
99 print(pd.DataFrame(data, columns=headers))
100 print('\n\n')
101
102 # Save results as json
103 pd.DataFrame(results, columns=headers) \
104 .to_json(out_path / 'flights.json',
105 orient="records", indent=4)
106
107
108 def extract_basic_travel_data(image_path):
109 """Extract the table information from an image.
110
111 Parameters
112 ----------
113 image_path: str or Path
114 The path with the image
115
116 Returns
117 -------
118 """
119 # Libraries
120 import pytesseract
121 import platform
122 # Tell the Python remote where the exe is (Windows)
123 if platform.system() == 'Windows':
124 pytesseract.pytesseract.tesseract_cmd = \
125 r'C:\Program Files\Tesseract-OCR\tesseract.exe'
126 # Perform OCR on the image
127 image = Image.open(image_path)
128 custom_config = r'--psm 6' # Assume uniform alignment for table
129 raw_text = pytesseract.image_to_string(image, config=custom_config)
130
131 # Initialize a list to store the extracted rows
132 data = []
133
134 # Regular expression to match each row in the table
135 row_pattern = re.compile(
136 r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Departure Date/Time
137 r"(\d{2}/\d{2}/\d{4} \d{2}:\d{2})\s+" # Arrival Date/Time
138 r"([A-Za-z0-9]+)\s+" # Voyage Code
139 r"(Inbound|Outbound)\s+" # In/Out
140 r"([A-Z]{3})\s+" # Dep Port
141 r"([A-Z]{3})" # Arrival Port
142 )
143
144 # Process the raw text line by line
145 lines = raw_text.split("\n")
146 for i, line in enumerate(lines):
147 match = row_pattern.search(line)
148 if match:
149 data.append(match.groups())
150
151 # Return
152 return data
153
154
155 def perform_validation(df):
156 """"""
157 MSG1 = 'Error: First flight in log is not Inbound'
158
159 # Create a new column to store error messages, default to 'OK'
160 df['validation_error'] = 'OK'
161
162 # Rule 1: Check if the very first flight is Inbound
163 if df.iloc[0]['In/Out'] != 'Inbound':
164 df.loc[0, 'validation_error'] = MSG1
165
166 # Rule 2: Check for consecutive Inbound/Outbound flights
167 # This flags a row if it's the same as the one BEFORE it
168 is_same_as_previous = df['In/Out'] == df['In/Out'].shift(1)
169 # This flags a row if it's the same as the one AFTER it
170 is_same_as_next = df['In/Out'] == df['In/Out'].shift(-1)
171
172 # A row is an error if either of the above conditions is true
173 is_consecutive = is_same_as_previous | is_same_as_next
174
175 # Add error messages where the sequence is broken
176 df.loc[is_consecutive, 'validation_error'] = \
177 'Error: Part of consecutive ' + df['In/Out'] + ' pair'
178
179 # Return
180 return df
181
182
183
184
185 # ---------------------------------------------------------------
186 # Main
187 # ---------------------------------------------------------------
188 # .. note:: Run the first time, later it can be disabled
189 # since the extracting of pags from pdf to
190 # png and the consequent OCR to extract the
191 # that information which will be saved in a json file
192 # only needs to be done once. Please check the json and
193 # add any missing flights (either due to poor extraction
194 # or not registered, or other methods of entry to the ountry"
195
196 # Flags
197 RUN_PDF2PNG = False # Extract pdf pages to png images
198 RUN_OCR = False # Extract data from png and save to json
199
200 # Define the desired headers for the DataFrame
201 headers = [
202 "Departure Date/Time", "Arrival Date/Time",
203 "Voyage Code", "In/Out", "Dep Port", "Arrival Port"
204 ]
205
206 # Configuration of bundles.
207 config = {
208 '1085721': { # BH
209 'start_page': 6,
210 'end_page': 12
211 },
212 '775243': { # VQ
213 'start_page': 6,
214 'end_page': 7
215 }
216 }
217
218 # Select the bundle identifier
219 #id = '775243'
220 id = '1085721'
221
222 # Path
223 pdf_path = Path('./data/%s-final-bundle.pdf' % id)
224 out_path = Path('./outputs/%s' % id)
225
226 # Extract images from pdf
227 if RUN_PDF2PNG:
228 pdf2png(pdf_path=pdf_path, out_path=out_path , **config[id])
229
230 # Extract all flights
231 if RUN_OCR:
232 png2json(png_path=out_path, out_path=out_path)
233
234
235 # -----------------------------------------------
236 # Clean and display
237 # -----------------------------------------------
238 # Libraries
239 from utils import display_flights
240 from utils import combine_outbound_inbound
241 from utils import MISSING_FLIGHTS
242 from utils import COLORMAP
243
244 # Load DataFrame (as extracted)
245 df = pd.read_json(out_path / 'flights.json')
246
247 # Append missing rows using concat
248 df_miss = pd.DataFrame(MISSING_FLIGHTS[id])
249 df = pd.concat([df, df_miss], ignore_index=True)
250 df = df.drop_duplicates()
251
252 # Save results as json
253 df.to_json(out_path / 'flights.json',
254 orient="records", indent=4)
255
256 # Ensure 'Departure Date/Time' is in datetime format
257 df["Departure Date/Time"] = pd.to_datetime(
258 df["Departure Date/Time"], format="%d/%m/%Y %H:%M")
259
260 # Order chronologically
261 df = df.sort_values(by='Departure Date/Time').reset_index(drop=True)
262
263 # Remove duplicates based on the date (ignoring hour) and
264 # keep the first occurrence
265 df["Date Only"] = df["Departure Date/Time"].dt.date
266 df = df.drop_duplicates(subset=['Date Only', 'Voyage Code'], keep='first')
267
268 # Sort the DataFrame by "Departure Date/Time"
269 df = df.sort_values(by="Departure Date/Time").reset_index(drop=True)
270
271 # Show
272 print("\n\n---> Extracted flight history:\n\n%s" % df)
273
274
275 # Validate
276 # --------
277 # Perform validation
278 df = perform_validation(df)
279
280 # Extract errors
281 error_df = df[df['validation_error'] != 'OK']
282
283 print("\n\n")
284 if error_df.empty:
285 print("All flight sequences are valid! ✅")
286 else:
287 print("Found errors in the flight sequence: ⚠️ \n")
288 print(error_df)
289
290
291
292 # Find the first 'Outbound'
293 # ------------------------
294 # .. note:: This step has been made redundant. Its
295 # functionality is now incorporated into the
296 # combine_outbound_inbound function.
297
298 """
299 # Find the first 'Outbound' flight and trim the DataFrame.
300 # We do this because we want to compute time abroad, hence
301 # each period would be (current outbound - next inbound)
302 try:
303 # Get the index of the first row where 'In/Out' is 'Inbound'
304 first_inbound_index = df[df['In/Out'] == 'Outbound'].index[0]
305 # Slice the DataFrame to start from that index
306 df = df.loc[first_inbound_index:].reset_index(drop=True)
307 except IndexError:
308 pass
309 """
310
311
312
313 # Combine in/out journeys
314 # -----------------------
315 # .. note:: For the most accurate results, this function should
316 # be run after the flight data JSON has been manually
317 # corrected. If the data is inconsistent, it will apply
318 # its own assumptions to handle errors (e.g., pairing the
319 # last seen 'Outbound' with the next 'Inbound' and
320 # ignoring invalid sequences).
321
322 # Combine (inbound, outbound) pairs
323 df_cmb = combine_outbound_inbound(df)
324
325 # Save results as json
326 df_cmb.to_json(out_path / 'roundtrips.json',
327 orient="records", indent=4)
328
329 # Show
330 print("\n\n---> Combined trips:\n\n%s" % df_cmb)
331
332
333 # Display and save
334 # -----------------------
335 import matplotlib.pyplot as plt
336 display_flights(df_cmb, cmap=COLORMAP)
337 plt.savefig(out_path / 'graph.jpg')
338 plt.show()
Total running time of the script: ( 0 minutes 1.118 seconds)