ChatRTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, images, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. It all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results.
Running 70B Llama 3 models on a PC. A 70b model uses approximately 140gb of RAM (each parameter is a 2 byte floating point number). If you want to run with full precision, it can be done llama.cpp and a Mac that has 192GB of unified memory. The speed will not be that great (maybe a couple of tokens per second). If you run with 8 bit quantization, RAM requirements is dropped by half and speed is also improved. You can build a PC with 2 used RTX 3090 which give you 48gb of VRAM, but very good speed when running the 4-bit quantized version.
There is video that discusses more specifics about using Nvidia 3090 or 4070 GPUs for large local models. There are also people using Nvidia A100s that have modified in China. The Nvidia A100s are more expensive and power hungry
Fine tuning with consumer hardware is viable, you probably need to rent a couple of A100 in a cloud provider.
Nvidia described a proper current AI inference setup needs to process 10 tokens per second.
The most developer friendly method of local LLM inference that requires between 48G and 150G vram is on a single Apple M2 chip. The cost is between US$5-10K for Apple Silicon (50 of 64G) to (150 of 192G). The cost for two A6000’s is similar at around $12k for 96G VRAM. Smaller models like the 7B can run ok on base Lenovo P1Gen6 Ada 3500 or Macbook Pro M3 Max as well.
Nvidia has new drivers forimproving performance of local LLMs. ONNX Runtime (ORT) and DirectML using the new NVIDIA R555 Game Ready Driver. ORT and DirectML are high-performance tools used to run AI models locally on Windows PCs.
Here is a another local AI setup. This step-by-step tutorial covers installing Ollama, deploying a feature-rich web UI, and integrating stable diffusion for image generation. Learn to customize AI models, manage user access, and even add AI capabilities to your note-taking app. Whether you’re a tech enthusiast or looking to enhance your workflow, this video provides the knowledge to harness the power of AI on your local machine.
➡️Lian Li Case: https://geni.us/B9dtwB7
➡️Motherboard – ASUS X670E-CREATOR PROART WIFI: https://geni.us/SLonv
➡️CPU – AMD Ryzen 9 7950X3D Raphael AM5 4.2GHz 16-Core: https://geni.us/UZOZ5
➡️Power Supply – Corsair AX1600i 1600 Watt 80 Plus Titanium: https://geni.us/O1toG
➡️CPU AIO – Lian Li Galahad II LCD-SL Infinity 360mm Water Cooling Kit: https://geni.us/uBgF
➡️Storage – Samsung 990 PRO 2TB Samsung: https://geni.us/hQ5c
➡️RAM – G.Skill Trident Z5 Neo RGB 64GB (2 x 32GB): https://geni.us/D2sUN
➡️GPU – MSI GeForce RTX 4090 SUPRIM LIQUID X 24G Hybrid Cooling 24GB: https://geni.us/G5BZ
RAG – Using Your Own Data With Llama 3
Local Agentic RAG w/ llama3 for 10 times the performance with your local data.
Phones Are Not Good for Running Local LLM – but Small LLM Models Hardware Adaptations Will Make This More Viable
Here is some of the processing and battery supply of some iPhones.
iPhone 8 / 8 Plus / X (2017): Uses the A11 Bionic chip, estimated at around 600 GFLOPS.
iPhone XS / XS Max / XR (2018): Uses the A12 Bionic chip, estimated at around 1.2 TFLOPS.
iPhone 11 / 11 Pro / 11 Pro Max / SE (2019): Uses the A13 Bionic chip, estimated at around 1.8 TFLOPS.
iPhone 12 / 12 Mini / 12 Pro / 12 Pro Max (2020): Uses the A14 Bionic chip, estimated at around 2.5 TFLOPS.
iPhone 13-14: Uses the A15 Bionic chip, estimated at around 3.3-3.6 TFLOPS.
iPhone 15 Pro / 15 Pro Max (2023): Uses the A17 Pro chip, estimated at around 4.0 TFLOPS.
iPhone 12 Pro (2020): 10.78 Wh (2815 mAh, 3.83V)
iPhone 12 Pro Max (2020): 14.13 Wh (3687 mAh, 3.83V)
iPhone 13 (2021): 10-16.75 Wh
iPhone SE (2022): 7.82 Wh (2018 mAh, 3.88V)
iPhone 14 (2022): 10-15 Wh (3279 mAh, 3.85V)
iPhone 15 (2023): 13-17 Wh (3349 mAh, 3.87V)

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
Hello Brian, may be you miss that https://www.macrumors.com/2024/05/09/apple-to-power-ai-features-with-m2-ultra-servers/
Apple is using their own ship to build AI server